real-stanford / diffusion_policy

[RSS 2023] Diffusion Policy Visuomotor Policy Learning via Action Diffusion
https://diffusion-policy.cs.columbia.edu/
MIT License
1.39k stars 262 forks source link

Could you please clarify the evaluation code? (possible bug) #6

Closed LostXine closed 1 year ago

LostXine commented 1 year ago

Hello,

Thank you so much for your amazing work and beautiful code. However, when I was reading the code, I got confused in this reward collection section. Could you please clarify why you use the len(self.env_fns) in the first line of this block? https://github.com/columbia-ai-robotics/diffusion_policy/blob/27395b75008269ebac3ceb2192fadd647f288e7f/diffusion_policy/env_runner/robomimic_lowdim_runner.py#L320-L325

My understanding is that your current code will only take part of the trajectories into consideration when the number of running simulators is smaller than the number of trajectories to test. Please correct me if I am wrong, this line should be for i in range(n_inits): to take all trajectories into consideration. Could you take a look? Will this issue affect the numbers you reported in the paper?

Similarly at: https://github.com/columbia-ai-robotics/diffusion_policy/blob/27395b75008269ebac3ceb2192fadd647f288e7f/diffusion_policy/env_runner/robomimic_image_runner.py#L327 https://github.com/columbia-ai-robotics/diffusion_policy/blob/27395b75008269ebac3ceb2192fadd647f288e7f/diffusion_policy/env_runner/kitchen_lowdim_runner.py#L282 https://github.com/columbia-ai-robotics/diffusion_policy/blob/27395b75008269ebac3ceb2192fadd647f288e7f/diffusion_policy/env_runner/blockpush_lowdim_runner.py#L238

Btw, I'm also curious that why you take the mean of ten checkpoints as your evaluation metric, do you have a specific reason to do so?

Thank you so much for your time.

Best regards,

Xiang Li

cheng-chi commented 1 year ago

@LostXine Thank you so much for pointing this out! Yes this is a legit bug that can change eval metrics. Fortunately, I don't think this bug will invalidate our conclusions because:

  1. This bug only affects the variance of the metrics (averaged over half of the samples), not their mean.
  2. All of our baselines are evaluated using the same code.

Since this bug also affects metric logging, I can't just recompute the metrics from logs. Rerunning experiments will be very expensive. Therefore, I'm going to:

  1. Fix the bug in this repo so people building off of this codebase won't have the same bug.
  2. Leave comments in the code so people are aware of the bug.
  3. Change our ArXiv version of the paper to reflect the correct number of samples used for the metrics.

I will leave this issue open before I get all of these done.

LostXine commented 1 year ago

@cheng-chi Thank you so much for your quick response, have a nice day.

cheng-chi commented 1 year ago

ArXiv replacement submitted. PDF updated on the project website.