Implementation details of the loss scaling algorithm

Thank you for sharing such great works!

I encountered the following problem during the reproduction of your paper and was wondering if you might be willing to offer some guidance or clarification.

I trained the 2M VIMAPolicy (with all weights initialized by their default initial distributions except T5) on a small subset of the VIMA-Bench dataset (32 samples per task and 13 tasks in total) and tried to make it overfit.

It can be found that the imitation loss (calculated by cross_entropy_loss(dist_dict._logits, discrete_target_action)) of different action attributes (such as pose0_rotation, pose1_position) can change very differently during the training process, like the plot showing below. In this experiment, the final loss is calculated by taking the sum of all those action attributes with equal weights and then normalized by time step length

The plot shows how different loss (per step) attributes converges.
for example, `pose0_rotation_0` means the loss associated with the first dimension of `pose0_rotation` at a single time step.

By zooming to the first and last 100 epochs of the experiment, it can be found all dimensions of pose0_rotation and the first two dimensions of pose1_rotation converge very quickly to zero while the other losses converge relatively slow. The scaling between them changes dynamically.

First100 epochs

Last 100 epochs

In the same experiment, I also measured the ratio of the average loss between different tasks and got the following table. For example, 16.745474 means that the average loss of rearrange_then_restore samples is about 16x larger than the one of novel_noun samples

novel_noun 1.000000 sweep_without_exceeding 1.602642 rotate 1.857377 visual_manipulation 1.998764 twist 3.802508 manipulate_old_neighbor 4.956325 scene_understanding 5.033336 follow_order 5.132609 rearrange 5.827855 pick_in_order_then_restore 11.248917 rearrange_then_restore 16.745474

I would like to know how those losses (per action attribute and per task) are balanced during training. Thank you

vimalabs / VIMA

Implementation details of the loss scaling algorithm #53