tomato1mule / diffusion_edf

[CVPR 2024 Highlight] Diffusion-EDFs: Bi-equivariant Denoising Generative Modeling on SE(3) for Visual Robotic Manipulation
MIT License
36 stars 4 forks source link

How to evaluate the trained models? #5

Closed ZXP-S-works closed 6 months ago

ZXP-S-works commented 6 months ago

Hi Hyunwoo,

Thanks for the great work!

I am wondering how you evaluate the trained models. I find that the trained checkpoints are saved in ./runs/..., but the evaluation notebook load models are saved in the ./checkpoints/[task]/... There is a naming change as well. I guess there is a script that automatically moves the checkpoints and renames it.

Another question is the training trained 3 models for one task (e.g., pick_ebm, pick_hires, pick_lowres). It seems you used a low-resolution model for the initial diffusion steps, then used a hi-resolution for the final diffusion steps, and then used ebm model to evaluate the diffused trajectories. Is this correct? If so, why not just use one diffusion model and diffuse just one trajectory?

Thank you for your time!

Best, XP

tomato1mule commented 6 months ago

Hi Xupeng,

Thanks for your interest in our work!

I am wondering how you evaluate the trained models. I find that the trained checkpoints are saved in ./runs/..., but the evaluation notebook load models are saved in the ./checkpoints/[task]/... There is a naming change as well. I guess there is a script that automatically moves the checkpoints and renames it.

They are not automatically renamed and moved. I just manually renamed them :)

Another question is the training trained 3 models for one task (e.g., pick_ebm, pick_hires, pick_lowres). It seems you used a low-resolution model for the initial diffusion steps, then used a hi-resolution for the final diffusion steps, and then used ebm model to evaluate the diffused trajectories. Is this correct? If so, why not just use one diffusion model and diffuse just one trajectory?

Yes this is an important point. I first tried with a single resolution model but found that the model is biased to coarse global features. When the diffusion time is large, only the coarse-grained global geometries matter and a local fine-grained geometries are not very important. The problem is, when using a single resolution model, this negatively impacts the quality score-matching for smaller diffusion times where local fine-grained geometries become more important.

As an illustrative example, if one uses a single resolution model to generate mug-pick poses, the generated orientation is always perpendicular to the table, even though the target mug is rotated. This is because only upright mugs were presented during the training, and the model rather learned to generate gripper poses perpendicular to the table, instead of learning to generate poses equivariant to the mug. This is not necessarily wrong or bad, but just that there is an ambiguity in demonstration. Nevertheless, this loss of locality makes generated pose a bit inaccurate. Therefore, I separated high and low resolution model so as to prevent negative transfer between smaller and larger diffusion time. Such coarse-to-fine modeling is quite prevalent in many 3D robotic manipulation models.

However, I believe that it is possible to prevent this negative transfer in a single model by adaptively controlling the bandwidth of the model. Currently, the pooling layer in the encoder has a point-wise skip connection, so essentially applies no low-pass filter compared to 2D UNet's average pooling. I think this might also have been a problem.

ZXP-S-works commented 6 months ago

Hi Hyunwoo,

Thanks for the explanation! That addressed my questions.

Best, XP