lucidrains / robotic-transformer-pytorch

Implementation of RT1 (Robotic Transformer) in Pytorch
MIT License
375 stars 31 forks source link

A few questions about this implementation #6

Open sebbyjp opened 8 months ago

sebbyjp commented 8 months ago
  1. Are the past images in a video used to condition the hidden layers like in https://deepimagination.cc/eDiff-I/ ? : image

  2. Why are you predicting the actions for each frame of the video (output is (b, f, action dim, vocab_size)) instead of the expected (b, action dim, vocab_size) for a next action prediction)? . The cross entropy loss for the final action prediction (labeled single eval loss) seems rather high, although still an improvement over rt1x released by Google and Octo:

    image
  3. Additionally the training cross entropy loss over the entire frame prediction seems to saturate before reaching 0 for the LR schedules I tried:

    image

Additional info: -I'm only able to run batch size of 16 on my GPUs, maybe that is the issue. Or potentially data augmentation from https://github.com/octo-models/octo/blob/main/examples/06_pytorch_oxe_dataloader.py is the issue.

-I am using a pre-trained MaxViT from pytorch with your classifier_free_guidance layers as seen here: https://github.com/kyegomez/RT-X/blob/031e6edb1734774e772f497b11fb49df634fef8d/rtx/rtx1.py#L402 (I'm happy to make a pull request to add this option here as well).

-I am using https://github.com/sebbyjp/robo_transformers for comparision to official rt1x and octo baselines