markusgrotz / peract_bimanual

Apache License 2.0
23 stars 5 forks source link

[Eval] How to choose the best checkpoint in the paper? #6

Open aopolin-lv opened 3 weeks ago

aopolin-lv commented 3 weeks ago

Hello, after completing the training of the model, I don't know how to choose the right ckpt. So, I would appreciate it if I could answer any questions.

  1. When evaluating and testing, do you execute the eval.py script on the ckpt saved every 10k steps to select the ckpt with the highest score, after the training process completed? Specifically, given the number of training steps is 40k, the ckpt with 10k, 20k, 30k, and 40k will be evaluated one by one, and the ckpt with the highest score will be selected for the final test on the testset.

  2. How can I improve the speed of testing? Specifically, when I run the eval.py script, it takes 1 hour to complete 25 episodes of a single task. The hardware I'm using includes an Intel-8352V CPU with 72 cores and an A800-80G GPU with performance similar to the A100-80G. May I ask what your typical efficiency is when running eval.py?

aopolin-lv commented 3 weeks ago

By the way, the valiadation sets has not been accessed from the ftp server. Could you please upload the relevant datasets?

markusgrotz commented 3 weeks ago
  1. Training: I use SLURM for launching my jobs for training/evaluation. Everything is pretty automated. I will provide some details soon. The code / documentation is still work in progress and I hope I can find more time to work on that soon.
  2. Evaluation: That is too slow. Do you have more insights on the setup? Is it a headless setup?
  3. Dataset: I haven actively looking for an alternative hosting option and transfered the data today to https://dataset.cs.washington.edu/fox/bimanual/

Let me know if that helps

aopolin-lv commented 3 weeks ago

Thank you for your reply. The data download is now very convenient! However, I still have some doubts about the time consumption for training/evaluating.

  1. Training: I used the _bimanualperact configuration with a batch_size of 4, which occupied about 46GB of GPU memory. Training for 40k iterations took approximately 15-16 hours.
  2. Evaluating: I used 25 episodes, with each task taking about 1 hour. Everything was done under the headless setting.

The paper mentions that using the bimanual setting would result in a total training time of about 54 hours. However, my single-task training takes 15 hours, and the total training time for all tasks is 15 * 13 = 195 hours, which far exceeds the time reported in the paper. Is there anything I should improve? And the evaluating period also takes too much time, how can I do to reduce the cost?

markusgrotz commented 2 weeks ago

That's great that you're able to train the network! Just to clarify, the paper doesn't mention the total training cost; instead, Table 4 reports the average training time. To estimate your total training time, you'd multiply this average by the number of tasks you’re running. Given that your setup may differ in hardware or other configurations, it's also expected that the actual time might vary.

Regarding the evaluation, I assume this is due to the headless mode, but I need more information about this. Is this some kind of HPC system? Happy to chat since to speed things up

aopolin-lv commented 2 weeks ago

I apologize for mistakenly considering the average task training time in the paper as the total training time. So far, I have only completed the training for the coordinated_lift_ball task and have not yet conducted a full test to verify the effectiveness of the training. Additionally, I am not very familiar with HPC. I am using a regular GPU computing server without any special modifications. By the way, could you please provide the specific configurations for training and validation? This would help us troubleshoot in case any issues arise.

aopolin-lv commented 2 weeks ago

Hi, could you release the model checkpoints (including ACT/RVT-LF/Peract-LF/Peract^2) for reimplementing the result cited in the paper?

markusgrotz commented 4 days ago

Hi aopolin-lv,

I have the first results for multi-task training! I will update the webpage with the results soon. But I would like to first finish the documentation. I can also share my checkpoints then

Let me know if you have any further questions.

Kind regards, Markus