vimalabs / VIMA

Official Algorithm Implementation of ICML'23 Paper "VIMA: General Robot Manipulation with Multimodal Prompts"
MIT License
752 stars 85 forks source link

Implementation details of the loss scaling algorithm #53

Open jiajingk opened 6 months ago

jiajingk commented 6 months ago

Thank you for sharing such great works!

I encountered the following problem during the reproduction of your paper and was wondering if you might be willing to offer some guidance or clarification.

I trained the 2M VIMAPolicy (with all weights initialized by their default initial distributions except T5) on a small subset of the VIMA-Bench dataset (32 samples per task and 13 tasks in total) and tried to make it overfit.

It can be found that the imitation loss (calculated by cross_entropy_loss(dist_dict._logits, discrete_target_action)) of different action attributes (such as pose0_rotation, pose1_position) can change very differently during the training process, like the plot showing below. In this experiment, the final loss is calculated by taking the sum of all those action attributes with equal weights and then normalized by time step length


The plot shows how different loss (per step) attributes converges.
for example, `pose0_rotation_0` means the loss associated with the first dimension of `pose0_rotation` at a single time step.

By zooming to the first and last 100 epochs of the experiment, it can be found all dimensions of pose0_rotation and the first two dimensions of pose1_rotation converge very quickly to zero while the other losses converge relatively slow. The scaling between them changes dynamically.


First100 epochs


Last 100 epochs

In the same experiment, I also measured the ratio of the average loss between different tasks and got the following table. For example, 16.745474 means that the average loss of rearrange_then_restore samples is about 16x larger than the one of novel_noun samples

novel_noun                     1.000000
sweep_without_exceeding        1.602642
rotate                         1.857377
visual_manipulation            1.998764
twist                          3.802508
manipulate_old_neighbor        4.956325
scene_understanding            5.033336
follow_order                   5.132609
rearrange                      5.827855
pick_in_order_then_restore    11.248917
rearrange_then_restore        16.745474

I would like to know how those losses (per action attribute and per task) are balanced during training. Thank you

amitkparekh commented 2 months ago

I've encountered similar questions. I released everything I did here: https://github.com/amitkparekh/CoGeLoT, maybe it has some answers to your questions?