about training cost - Githubissues

nickgkan / 3d_diffuser_actor

Code for the paper "3D Diffuser Actor: Policy Diffusion with 3D Scene Representations"

https://3d-diffuser-actor.github.io/

MIT License

159 stars 16 forks source link

about training cost #15

Closed PointsCoder closed 3 months ago

PointsCoder commented 3 months ago

Hi,

Thanks for your great work!

I am trying to train your model on CALVIN and found that the training time is very long. In your default setting it seems to train for 600000 iterations and will take 200 GPU hours on my side. Can I ask what is the training setting (how many GPU cards, batch size on a single GPU card, training iterations) for your best CALVIN models?

PointsCoder commented 3 months ago

I found some useful explanations from prior issues #9 #5 , but there is still a question I want to ask: You said that you used batch size 1080 with 6 GPU cards, but it seems that 40GB cards can only accommodate batch size 30 in each GPU, so I am wondering your exact setting for training CALVIN?

twke18 commented 3 months ago

Hi,

The effective batch size in our experiments is in fact num_gpus * num_episodes_per_gpu * length_per_episode. You can see this line for chunking every episode. If the length of the episode is smaller than the specified one, we will use the whole episode. In our experiments, we use 6 gpus, 30 episodes per gpu, and a maximum episode length of 30. That is to say, our maximal batch size would be 5400, not 1080. We didn't notice this mistake until now! Sorry for the confusion.

On CALVIN, I believe you don't need to train the full 600,000 iterations. We just trained for 65,000 iterations in our experiments. Also, we didn't ablate the effect of different batch size, since we spent most our effort on experiments in RLBench.

PointsCoder commented 3 months ago

Thank you! That addressed my concerns.

chenfengxu714 commented 3 months ago

Hi,

I notice that in your training log, your training takes one day. I use 6 A6000 GPUS with your script config for 60K iteration, yet it takes ~6 days. May I know any modification can accelerate it? I think there should be some misalignments between yours and ours.

Best regards

twke18 commented 3 months ago

Hi,

IO is indeed the bottleneck of our code base. We hosted the data on a SSD disk. If not using SSD, it would take a lot of time for training the model. Also, the number of threads matters. Our CPU is a AMD EPYC 7502 32-Core Processor.

chenfengxu714 commented 3 months ago

Thanks for your reply! yeah, I tweak a little bit of these settings and now it can be trained in one day!