One Question about Training Time

nianticlabs / simplerecon

[ECCV 2022] SimpleRecon: 3D Reconstruction Without 3D Convolutions

Other

1.3k stars 121 forks source link

One Question about Training Time #25

Closed cnexah closed 1 year ago

cnexah commented 1 year ago

Thank you for your great work!

I have a question about the training time: How long does it take to train the network?

In my understanding, it takes 36 hours for 9 epochs on 2 A100, is it correct? But when I run the code, it takes around 500 hours on 8 V100. One difference is that I don't have cached low-resolution depth maps, because it takes too much disk space. Can it make such a big difference? Do you have any suggestions?

Thank you for your help!

mohammed-amr commented 1 year ago

Not having cacbed images would make a difference for sure, especially if you're using a harddrive for storing those disks.

We've included handy conversions for those images.

500 hours is a little suspicious though. Do you have enough data workers? Do you have too many? Can you check what htop says the CPU utilization is?

cnexah commented 1 year ago

Thank you for your prompt reply! I set the number of workers to 16, and the batch size to 4 on each GPU. I cannot check the CPU utilization for some reason. When I decrease the number of workers, the training time will increase in my memory.

mohammed-amr commented 1 year ago

Looks like you're IO limited. Using precached images and/or SSDs should help.