nikriz1 commented 1 month ago

Hello,

Thank you for sharing this code and your research. I’ve been experimenting to test dynamo in the Franka Kitchen environment but encountered difficulties in reproducing the policy performances reported in the paper.

I’ve tried with the encoder trained using the code and the provided checkpoint, then trained VQ-BeT on the encoders. All configurations except paths are kept same as the code provided, which is identical to parameters from papers.

added below code in models.encoder.multiview_stack.MultiviewStack to load provided checkpoint to encoder.

from collections import OrderedDict
from .resnet import resnet18

...

        if type(encoders[0]) == OrderedDict: 
            resnet = resnet18(
                    pretrained=False,
                    output_dim=output_dim,
                    unit_norm=False,
                )
            resnet.load_state_dict(encoders[0])
            encoders = [resnet]

Below are wandb logs of SSL ssl *sorry for unrecognizable values and some missing plots

wandb logs of BeT orange is trained encoder, pink is provided checkpoint. bet *eval total_loss mins at 9.6 then overfits to 11+

To my understanding, eval_on_env should match the Table 1's dynamo results from the paper. Any idea what could have been gone wrong about my experiments?

The experiments are done with Ubuntu 20.04, python 3.8, torch 2.4.0+cu121, CUDA 12.2, single H100 with all packages installed by provided conda_env.yml except CUDA related ones.

I would greatly appreciate any guidance you could offer to help me resolve this issue.

nikriz1 commented 1 month ago

Above experiments were finished with checkpoint max 1.95 final 1.15 trained max 2.5 final 1.2

Other tasks results were blockpush max moved mean 1.55 blockpush max entered mean 0.52 (paper 0.65) pushT max max coverage mean 0.62943 pushT max final coverage mean 0.57945 (paper 0.66) libero max eval on env 0.85 (paper 0.93)

all used re-trained encoders, shows overfitting eval total loss.

jeffacce commented 1 month ago

Hey @nikriz1, thanks for checking out our work and running the experiments!

Sim kitchen

Yeah sim kitchen shouldn't be this low. This sounds like the policy is out of distribution perhaps?

Is it rendering with glfw or egl? There might be differences between CPU/GPU rendering.
Could you give us the MuJoCo version?
Which tasks are completed?

We've just run some repro runs with freshly trained encoders and the uploaded checkpoint:

The highlighted run (pink) is with a randomly initialized, frozen ResNet18 (random features), and the other runs are with either the provided checkpoint or a freshly trained encoder. Sim kitchen is a relatively open-loop environment, and in our experiments we've seen that just random or ImageNet pretrained features should do above 3. We couldn't reproduce your 1.2 result on our machine. We're using MuJoCo 2.1.0 with MUJOCO_GL=egl and the provided conda env.

The 3.64 run reported in the paper is included below.

Blockpush, Push-T, LIBERO

Yep these results sound reasonable. PyTorch is deterministic up to the same hardware, and different machines will yield slightly different results. We fix the seed and use deterministic CUDA algorithms, so the same hardware will give you deterministic encoder training, and very close to deterministic policy eval results, but results across machines won't be exactly the same. (See: https://pytorch.org/docs/stable/notes/randomness.html)

We did run a few repros on different cards and found the following:

Blockpush

Blocks entered: 0.65 - 0.79

Run reported in paper: 0.65

Push-T

Final coverage 0.59 - 0.66

Run reported in paper: 0.66

LIBERO Goal

Result 0.78 - 0.93

Run reported in paper: 0.93

Let me know if you have any questions!

nikriz1 commented 1 month ago

Thank you so much for your kind and detailed response!

We also have MuJoCo 2.1.0 binary installed, along with python environment in conda_env.yml (gym==0.23.1, mujoco==2.3.7, mujoco_py==2.1.2.14) and rendering setting specified in README (MUJOCO_GL=egl)

My re-trained encoder checkpoint and completed tasks (max 1.2) for kitchen environment are here:

I'll also try to run the experiments on a different machine as well. Thank you!

nikriz1 commented 1 month ago

Kitchen worked fine (3.75) on different machine!

jeffacce / dynamo_ssl

Difficulty in reproducing results from paper #6

Sim kitchen

Blockpush, Push-T, LIBERO

Blockpush

Push-T

LIBERO Goal