Closed nikriz1 closed 1 month ago
Above experiments were finished with checkpoint max 1.95 final 1.15 trained max 2.5 final 1.2
Other tasks results were blockpush max moved mean 1.55 blockpush max entered mean 0.52 (paper 0.65) pushT max max coverage mean 0.62943 pushT max final coverage mean 0.57945 (paper 0.66) libero max eval on env 0.85 (paper 0.93)
all used re-trained encoders, shows overfitting eval total loss.
Hey @nikriz1, thanks for checking out our work and running the experiments!
Yeah sim kitchen shouldn't be this low. This sounds like the policy is out of distribution perhaps?
glfw
or egl
? There might be differences between CPU/GPU rendering.We've just run some repro runs with freshly trained encoders and the uploaded checkpoint:
The highlighted run (pink) is with a randomly initialized, frozen ResNet18 (random features), and the other runs are with either the provided checkpoint or a freshly trained encoder. Sim kitchen is a relatively open-loop environment, and in our experiments we've seen that just random or ImageNet pretrained features should do above 3. We couldn't reproduce your 1.2 result on our machine. We're using MuJoCo 2.1.0 with MUJOCO_GL=egl
and the provided conda env.
The 3.64 run reported in the paper is included below.
Yep these results sound reasonable. PyTorch is deterministic up to the same hardware, and different machines will yield slightly different results. We fix the seed and use deterministic CUDA algorithms, so the same hardware will give you deterministic encoder training, and very close to deterministic policy eval results, but results across machines won't be exactly the same. (See: https://pytorch.org/docs/stable/notes/randomness.html)
We did run a few repros on different cards and found the following:
Blocks entered: 0.65 - 0.79
Run reported in paper: 0.65
Final coverage 0.59 - 0.66
Run reported in paper: 0.66
Result 0.78 - 0.93
Run reported in paper: 0.93
Let me know if you have any questions!
Thank you so much for your kind and detailed response!
We also have MuJoCo 2.1.0 binary installed,
along with python environment in conda_env.yml
(gym==0.23.1, mujoco==2.3.7, mujoco_py==2.1.2.14
)
and rendering setting specified in README (MUJOCO_GL=egl
)
My re-trained encoder checkpoint and completed tasks (max 1.2) for kitchen environment are here:
I'll also try to run the experiments on a different machine as well. Thank you!
Kitchen worked fine (3.75) on different machine!
Hello,
Thank you for sharing this code and your research. I’ve been experimenting to test dynamo in the Franka Kitchen environment but encountered difficulties in reproducing the policy performances reported in the paper.
I’ve tried with the encoder trained using the code and the provided checkpoint, then trained VQ-BeT on the encoders. All configurations except paths are kept same as the code provided, which is identical to parameters from papers.
added below code in
models.encoder.multiview_stack.MultiviewStack
to load provided checkpoint to encoder.Below are wandb logs of SSL *sorry for unrecognizable values and some missing plots
wandb logs of BeT orange is trained encoder, pink is provided checkpoint. *eval total_loss mins at 9.6 then overfits to 11+
To my understanding, eval_on_env should match the Table 1's dynamo results from the paper. Any idea what could have been gone wrong about my experiments?
The experiments are done with Ubuntu 20.04, python 3.8, torch 2.4.0+cu121, CUDA 12.2, single H100 with all packages installed by provided conda_env.yml except CUDA related ones.
I would greatly appreciate any guidance you could offer to help me resolve this issue.