jeffacce / dynamo_ssl

DynaMo: In-Domain Dynamics Pretraining for Visuo-Motor Control
MIT License
88 stars 13 forks source link

Question about evaluation metrics in PushT environment: Fixed episode length and final coverage choice #8

Open gritYCDA opened 1 week ago

gritYCDA commented 1 week ago

Question about evaluation metrics in PushT environment: Fixed episode length and final coverage choice

First, thank you for sharing this impressive work and codebase. I have some questions about the evaluation methodology in the PushT environment, specifically regarding the choice of episode length and metrics.

Current Implementation and Results

From the code and my reproduction of experiments using your codebase:

# Fixed episode length implementation
if self.step_idx > 300:
    done = True

# Results from my experimental runs
final coverage max: 0.97899    # Best final state coverage
final coverage mean: 0.54716   # Average final state coverage
max coverage max: 0.97953      # Best achieved coverage
max coverage mean: 0.59142     # Average max coverage achieved

https://github.com/jeffacce/dynamo_ssl/blob/62874039e5d3fc663b0fccd2083b2dd2c8d9935e/envs/pusht.py#L478

Questions

  1. What was the reasoning behind using fixed 300-step episodes with final coverage as the main metric?

    • I noticed that episodes run for exactly 300 steps, regardless of task completion
    • The final coverage at this fixed endpoint (mean: 54.7%) is used as the main evaluation metric
    • I'm curious whether this metric might be sensitive to the arbitrary choice of episode length
    • This setup might not directly reflect the policy's task completion capability
  2. Would you be willing to share comparison results with other SSL methods using additional metrics?

    • Results using max coverage-based success rates
    • Time-to-success measurements (steps to achieve target coverage)
    • Stability metrics (duration of maintaining high coverage)
    • Any other performance metrics you might have explored

I'm genuinely interested in understanding the intended evaluation objectives and why this particular setup was chosen over alternatives. Your insights would be very helpful for better understanding the evaluation methodology in manipulation tasks.

jeffacce commented 5 days ago

Hi @gritYCDA ,

Thanks for your interest in our work!

  1. We use the Push-T environment implementation from VQ-BeT, which evaluates all episodes to 300 steps: https://github.com/jayLEE0301/vq_bet_official/blob/09d4851288ca5deaaa1ab367a208e520f8ee9a84/examples/pusht_env.py#L478 I believe this is from the original Push-T environment limiting max steps to 300: https://github.com/real-stanford/diffusion_policy/blob/548a52bbb105518058e27bf34dcf90bf6f73681a/diffusion_policy/config/task/pusht_image.yaml#L25 In our experiments, we observe that final coverage at T=300 and max coverage during the episode are highly correlated. We report the final coverage metric, consistent with the VQ-BeT paper.

  2. We haven't tried the alternative metrics you proposed here, but here are the Push-T runs in the paper with both final and max coverage listed:

    Representation Final Coverage Max Coverage
    Random 0.07 0.18
    ImageNet 0.41 0.46
    R3M 0.49 0.52
    VC-1 0.38 0.44
    MVP 0.20 0.28
    BYOL 0.23 0.32
    BYOL T 0.34 0.42
    MoCo v3 0.57 0.60
    RPT 0.56 0.58
    TCN SV 0.07 0.08
    MAE 0.07 0.08
    DynaMo 0.66 0.69

If you'd like to run additional experiments with alternative metrics, you can modify this part: https://github.com/jeffacce/dynamo_ssl/blob/62874039e5d3fc663b0fccd2083b2dd2c8d9935e/online_eval.py#L217-L224

Let us know if you have any other questions!

Best, Jeff