openvla / openvla

OpenVLA: An open-source vision-language-action model for robotic manipulation.
MIT License
1.03k stars 136 forks source link

Logs for training from scratch with open-x datasets #65

Closed hyy0613 closed 1 month ago

hyy0613 commented 1 month ago

Hi, thanks for your open source code, I'm trying to train the model from scratch with RT-X datasets. I noticed this sentence in the paper, We also experimented with incorporating a few additional datasets into our training mixture that were added to the OpenX dataset since the release of Octo, including the DROID dataset [ 11 ], although at a conservative mixture weight of 10%. In practice, we found that the action token accuracy on DROID remained low throughout training, suggesting a larger mixture weight or model may be required to fit its diversity in the future. To not jeopardize the quality of the final model, we removed DROID from the data mixture for the final third of training. Thus, I think I should select the vla.type as prism-dinosiglip-224px+mx-oxe-magic-soup-plus The configuration are as followed:

torchrun --nnodes 2 --nproc-per-node 8 vla-scripts/train.py \
  --vla.type prism-dinosiglip-224px+mx-oxe-magic-soup-plus
  --data_root_dir /mnt/dolphinfs/hdd_pool/docker/user/RTX

Other than reducing the number of nodes and batchsize proportionally to 512(16*32), I didn't change anything else. However, now that I have trained 8,000 steps, I find that the action token accuray will improve quickly in the beginning, but for a long time, the loss has remained around 1.55, and the average action token accuray has remained around 40%. I don't know if there is any problem in my training process. It will be really kind of you to give me some help, maybe I need to adjust the learning rate and other hyperparameters?Thank you very much.

kpertsch commented 1 month ago

The orange curve below is our training curve for OpenVLA @ 64*32 batch size. It's a bit unclear whether training would work with a much smaller batch size -- you could try gradient accumulation (at the expense of runtime of course). The hyperparameters in the released configs are exactly the parameters we used for our run.

image