Diffusion Policy in vqbet paper has lower performance than the diffusion policy paper

StarCycle commented 2 months ago

Hi @jayLEE0301,

Thank you for sharing this work! I have a small question about the result...

Table 2 computes the intersection of area IoU in the pusht task (this metric is not equal to success rate, since the pusht env think IoU > 0.95 is a success). The performance of diffusion policy is quite low here, only 0.66 with the DiffPolicy-C (UNet)

But if you go to the Diffusion policy paper, DiffPolicy-C can easily achieve IoU of 80% in real world

Do you use the official implementation? Any clue about it?

jayLEE0301 commented 2 months ago

Hello @StarCycle,

Thank you for your interest in our work.

Differing metrics in PushT task: To clarify, there are two main metrics for PushT: (a) success rate, which is counted when the policy achieves a certain coverage at any point during the policy rollout and (b) final coverage, which is measured as the coverage at the end of the policy rollout. We report the final coverage metric as it is more indicative of final performance and less susceptible to arbitrary increases in coverage that a randomized policy can achieve over a long enough horizon.

The values of Diffusion Policy was obtained by running the checkpoints which are downloaded from the official repo. (https://diffusion-policy.cs.columbia.edu/data/experiments/)

StarCycle commented 2 months ago

Hello @jayLEE0301 @notmahi ,

I test diffusion policy using final IoU (following your settings). I load the official checkpoint, evaluate the Diffusion Policy with Unet for 20 episodes and get an average final IoU of 81.8%, here is my result:

The full colab notebook is diffusion_policy_vision_pusht_demo.ipynb.txt, please remove .txt in the file extension. Notice that I do several modifications to follow your settings:

This notebook is relatively old, I downgrade Jax to use it: !pip install "jax[cuda12_local]==0.4.23" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
I modify the reward computation from reward = np.clip(coverage / self.success_threshold, 0, 1) to reward = np.clip(coverage, 0, 1), so it directly return IoU.
I define an evaluate_single_episode() function that returns rewards[-1], i.e., the final IoU in this case.

Perhaps something gets wrong on my side...would you like to look into it?

jayLEE0301 commented 2 months ago

Hello @StarCycle

We checked the colab notebook you sent, but there are some differences between your colab notebook code and our evaluation setting, which is based on the DiffusionPolicy original github repo (https://github.com/real-stanford/diffusion_policy):

We tested with 300-step length episodes (200 in your colab notebook)
If you set reward = np.clip(coverage / self.success_threshold, 0, 1), it returns you a value coverage divided by success_threshold, not coverage. This will return a value greater than coverage.
We tested with 100 trials. Considering that the variance between each episode is very high, we recommend you run 100 trials rather than 20.
We tested with checkpoints downloaded from here (https://diffusion-policy.cs.columbia.edu/data/experiments/image/pusht/diffusion_policy_cnn/). We tried to import this checkpoint on your colab notebook, but found that it is not simply imported by replacing ckpt_path = "pusht_vision_100ep.ckpt".

Thus, we would like to show how you can run the evaluation using the original github repo and checkpoint downloaded from here (https://diffusion-policy.cs.columbia.edu/data/experiments/image/pusht/diffusion_policy_cnn/)

git clone (https://github.com/real-stanford/diffusion_policy) and install all the dependencies
You will need to make changes to eval.py, pusht_image_runner.py, and pusht_env.py. We created a github gist so that you can change each part easily. Check out this revision that shows which parts of each file you need to modify. (These modifications enable getting coverage arrays from multi-processed envs)

The average coverage values we obtained from the DiffusionPolicy-CNN's three checkpoints (which are in our manuscript) were as follows 0.6835, 0.6101, 0.6957 Today, we ran this experiment again on a new machine and got the following values (we found that there's a significant variance between conda env / machine / seeds). 0.6306, 0.6656, 0.7990

We hope this answer is helpful to you. Thank you.

StarCycle commented 2 months ago

Hello @jayLEE0301 ,

Thank you for your response!

I always use reward = np.clip(coverage, 0, 1) instead of reward = np.clip(coverage / self.success_threshold, 0, 1) in my test.
I tested with 100 300-step length episodes, and get final IoU of 78.2%

Which ckpt do you use exactly? Do you mean this one: https://diffusion-policy.cs.columbia.edu/data/experiments/image/pusht/diffusion_policy_cnn/train_0/checkpoints/epoch%3D0500-test_mean_score%3D0.884.ckpt
Great thanks for your code and I will try it later!

notmahi commented 2 months ago

Hi @StarCycle!

We evaluated and reported the result from the models present here in the official checkpoint release: https://diffusion-policy.cs.columbia.edu/data/experiments/image/pusht/diffusion_policy_cnn/ When we tried to load it in the notebook you shared, we saw a lot of error messages that seem to imply the architecture in the official repo and the notebook might be different.

RuntimeError: Error(s) in loading state_dict for ModuleDict:
Missing key(s) in state_dict: "vision_encoder.conv1.weight", "vision_encoder.bn1.weight", "vision_encoder.bn1.bias", "vision_encoder.layer1.0.conv1.weight", "vision_encoder.layer1.0.bn1.weight", "vision_encoder.layer1.0.bn1.bias", "vision_encoder.layer1.0.conv2.weight", "vision_encoder.layer1.0.bn2.weight", "vision_encoder.layer1.0.bn2.bias", "vision_encoder.layer1.1.conv1.weight", "vision_encoder.layer1.1.bn1.weight", "vision_encoder.layer1.1.bn1.bias", "vision_encoder.layer1.1.conv2.weight", "vision_encoder.layer1.1.bn2.weight", "vision_encoder.layer1.1.bn2.bias", "vision_encoder.layer2.0.conv1.weight", "vision_encoder.layer2.0.bn1.weight", "vision_encoder.layer2.0.bn1.bias", "vision_encoder.layer2.0.conv2.weight", "vision_encoder.layer2.0.bn2.weight", "vision_encoder.layer2.0.bn2.bias", "vision_encoder.layer2.0.downsample.0.weight", "vision_encoder.layer2.0.downsample.1.weight", "vision_encoder.layer2.0.downsample.1.bias", "vision_encoder.layer2.1.conv1.weight", "vision_encoder.layer2.1.bn1.weight", "vision_encoder.layer2.1.bn1.bias", "vision_encoder.layer2.1.conv2.weight", "vision_encoder.layer2.1.bn2.weight", "vision_encoder.layer2.1.bn2.bias", "vision_encoder.layer3.0.conv1.weight", "vision_encoder.layer3.0.bn1.weight", "vision_encoder.layer3.0.bn1.bias", "vision_encoder.layer3.0.conv2.weight", "vision_encoder.layer3.0.bn2.weight", "vision_encoder.layer3.0.bn2.bias", "vision_encoder.layer3.0.downsample.0.weight", "vision_encoder.layer3.0.downsample.1.weight", "visio...
Unexpected key(s) in state_dict: "_dummy_variable", "obs_encoder.obs_nets.image.backbone.nets.0.weight", "obs_encoder.obs_nets.image.backbone.nets.1.weight", "obs_encoder.obs_nets.image.backbone.nets.1.bias", "obs_encoder.obs_nets.image.backbone.nets.4.0.conv1.weight", "obs_encoder.obs_nets.image.backbone.nets.4.0.bn1.weight", "obs_encoder.obs_nets.image.backbone.nets.4.0.bn1.bias", "obs_encoder.obs_nets.image.backbone.nets.4.0.conv2.weight", "obs_encoder.obs_nets.image.backbone.nets.4.0.bn2.weight", "obs_encoder.obs_nets.image.backbone.nets.4.0.bn2.bias", "obs_encoder.obs_nets.image.backbone.nets.4.1.conv1.weight", "obs_encoder.obs_nets.image.backbone.nets.4.1.bn1.weight", "obs_encoder.obs_nets.image.backbone.nets.4.1.bn1.bias", "obs_encoder.obs_nets.image.backbone.nets.4.1.conv2.weight", "obs_encoder.obs_nets.image.backbone.nets.4.1.bn2.weight", "obs_encoder.obs_nets.image.backbone.nets.4.1.bn2.bias", "obs_encoder.obs_nets.image.backbone.nets.5.0.conv1.weight", "obs_encoder.obs_nets.image.backbone.nets.5.0.bn1.weight", "obs_encoder.obs_nets.image.backbone.nets.5.0.bn1.bias", "obs_encoder.obs_nets.image.backbone.nets.5.0.conv2.weight", "obs_encoder.obs_nets.image.backbone.nets.5.0.bn2.weight", "obs_encoder.obs_nets.image.backbone.nets.5.0.bn2.bias", "obs_encoder.obs_nets.image.backbone.nets.5.0.downsample.0.weight", "obs_encoder.obs_nets.image.backbone.nets.5.0.downsample.1.weight", "obs_encoder.obs_nets.image.backbone.nets.5.0.downsample.1.bias", "obs_encoder.obs_nets.i...RuntimeError: Error(s) in loading state_dict for ModuleDict:

We ran 100-episode evals with the three models present in the link I shared above.

jayLEE0301 / vq_bet_official

Diffusion Policy in vqbet paper has lower performance than the diffusion policy paper #4