A few questions about this implementation

Are the past images in a video used to condition the hidden layers like in https://deepimagination.cc/eDiff-I/ ? :
Why are you predicting the actions for each frame of the video (output is (b, f, action dim, vocab_size)) instead of the expected (b, action dim, vocab_size) for a next action prediction)? . The cross entropy loss for the final action prediction (labeled single eval loss) seems rather high, although still an improvement over rt1x released by Google and Octo:
Additionally the training cross entropy loss over the entire frame prediction seems to saturate before reaching 0 for the LR schedules I tried:

Additional info: -I'm only able to run batch size of 16 on my GPUs, maybe that is the issue. Or potentially data augmentation from https://github.com/octo-models/octo/blob/main/examples/06_pytorch_oxe_dataloader.py is the issue.

-I am using a pre-trained MaxViT from pytorch with your classifier_free_guidance layers as seen here: https://github.com/kyegomez/RT-X/blob/031e6edb1734774e772f497b11fb49df634fef8d/rtx/rtx1.py#L402 (I'm happy to make a pull request to add this option here as well).

-I am using https://github.com/sebbyjp/robo_transformers for comparision to official rt1x and octo baselines

lucidrains / robotic-transformer-pytorch