Open gritYCDA opened 1 week ago
Hi @gritYCDA ,
Thanks for your interest in our work!
We use the Push-T environment implementation from VQ-BeT, which evaluates all episodes to 300 steps: https://github.com/jayLEE0301/vq_bet_official/blob/09d4851288ca5deaaa1ab367a208e520f8ee9a84/examples/pusht_env.py#L478 I believe this is from the original Push-T environment limiting max steps to 300: https://github.com/real-stanford/diffusion_policy/blob/548a52bbb105518058e27bf34dcf90bf6f73681a/diffusion_policy/config/task/pusht_image.yaml#L25 In our experiments, we observe that final coverage at T=300 and max coverage during the episode are highly correlated. We report the final coverage metric, consistent with the VQ-BeT paper.
We haven't tried the alternative metrics you proposed here, but here are the Push-T runs in the paper with both final and max coverage listed:
Representation | Final Coverage | Max Coverage |
---|---|---|
Random | 0.07 | 0.18 |
ImageNet | 0.41 | 0.46 |
R3M | 0.49 | 0.52 |
VC-1 | 0.38 | 0.44 |
MVP | 0.20 | 0.28 |
BYOL | 0.23 | 0.32 |
BYOL T | 0.34 | 0.42 |
MoCo v3 | 0.57 | 0.60 |
RPT | 0.56 | 0.58 |
TCN SV | 0.07 | 0.08 |
MAE | 0.07 | 0.08 |
DynaMo | 0.66 | 0.69 |
If you'd like to run additional experiments with alternative metrics, you can modify this part: https://github.com/jeffacce/dynamo_ssl/blob/62874039e5d3fc663b0fccd2083b2dd2c8d9935e/online_eval.py#L217-L224
Let us know if you have any other questions!
Best, Jeff
Question about evaluation metrics in PushT environment: Fixed episode length and final coverage choice
First, thank you for sharing this impressive work and codebase. I have some questions about the evaluation methodology in the PushT environment, specifically regarding the choice of episode length and metrics.
Current Implementation and Results
From the code and my reproduction of experiments using your codebase:
https://github.com/jeffacce/dynamo_ssl/blob/62874039e5d3fc663b0fccd2083b2dd2c8d9935e/envs/pusht.py#L478
Questions
What was the reasoning behind using fixed 300-step episodes with final coverage as the main metric?
Would you be willing to share comparison results with other SSL methods using additional metrics?
I'm genuinely interested in understanding the intended evaluation objectives and why this particular setup was chosen over alternatives. Your insights would be very helpful for better understanding the evaluation methodology in manipulation tasks.