question: expected performance of vq-bet?

huggingface / lerobot

🤗 LeRobot: Making AI for Robotics more accessible with end-to-end learning

Apache License 2.0

7.26k stars 670 forks source link

question: expected performance of vq-bet? #341

Open Jubayer-Hamid opened 3 months ago

Jubayer-Hamid commented 3 months ago

Hi,

Thank you to the LeRobot community for maintaining such a fantastic codebase. My research group and I have greatly benefited from your efforts. In my current project, I am using the repository primarily for analyzing algorithms across different environments. I wanted to raise an issue I am encountering with VQ-BeT. I have been using the model on PushT and I want to ensure that the results I am obtaining align with community expectations. If not, I might be using the VQ-BeT repository incorrectly and would appreciate any guidance.

I used the following command: python lerobot/scripts/train.py vqbet pusht

For VQ-BeT, it seems like the maximum success rate is exactly 60%, whereas for Diffusion Policy the maximum success rate is 74%. Below, I have attached the wandb figures for the success rate vs training steps (left is for VQ-BeT and right is for Diffusion Policy):

Are these results expected for the algorithm? If not, am I running the wrong commands to reproduce the SOTA results?

Thank you for your assistance.

aliberts commented 3 months ago

Hi there, your results look a bit below — but not too far off either — what we achieved on our pre-trained policy here.

How many eval steps did you do? Could you paste your command or a link to your wandb run to see your config?

@alexander-soare has probably more insights on this

alexander-soare commented 3 months ago

@Jubayer-Hamid thanks for raising this. VQ-BeT and Diffusion Policy should give about the same results. In fact, the models we have on the hub (DP, VQ-BeT) happen to both give 63.8% success rate with 500 evals.

If you try running evals with: python lerobot/eval/scripts.py -p path/to/pretrained_model eval.n_episodes=500 eval.batch_size=50 eval.use_async_envs=true use_amp=true, what do you get? The curves your are showing are likely only using 50 evaluation episodes if you used default settings, meaning the variance is quite high.

Jubayer-Hamid commented 3 months ago

Hi, thanks for the prompt response. After trying with 500 evaluate episodes, VQ-BeT's success rate managed to get much closer to Diffusion Policy's.

YuejiangLIU commented 3 months ago

Hi LeRobot authors,

Thank you for your fantastic repo!

I wanted to follow up regarding the expected results of VQ-BET. My collaborator and I ran your checkpoint and config across 500 episodes using the following command:

python lerobot/scripts/eval.py -p lerobot/vqbet_pusht eval.n_episodes=500 eval.batch_size=50

However, our results consistently came out lower than what’s reported on your HF page. Here are the results we obtained on two different GPU machines:

{'avg_sum_reward': 95.71936310205473, 'avg_max_reward': 0.8872214382670427, 'pc_success': 61.0}
{'avg_sum_reward': 99.5425249914288, 'avg_max_reward': 0.8906604772845598, 'pc_success': 61.0}

We're wondering if a recent code update might have impacted the evaluation. Could you please confirm the results for the released checkpoint?

Thanks, Yuejiang

alexander-soare commented 2 months ago

@YuejiangLIU I just ran: python lerobot/scripts/eval.py -p lerobot/vqbet_pusht eval.n_episodes=500 eval.batch_size=50 eval.use_async_envs=true and got

{'avg_sum_reward': 97.27730768077599, 'avg_max_reward': 0.8951385362406257, 'pc_success': 63.800000000000004, 'eval_s': 89.14394330978394, 'eval_ep_s': 0.17828788709640503} (this is as reported at https://huggingface.co/lerobot/vqbet_pusht)

I even ran it with eval.use_async_envs=true (just to match your command exactly) and got the same result.

I'm on commit hash 2252b42337669f56522866ac6f5ec47f6aef78da.

I'm wondering if this is somehow related to system configuration and hardware. I'm using an Nvidia RTX 3090 on Ubuntu 22. @aliberts any other ideas? (for context, you just need to view @YuejiangLIU's last message and this one - they are falling short by a tiny amount on success rate)