Closed cnexah closed 1 year ago
Not having cacbed images would make a difference for sure, especially if you're using a harddrive for storing those disks.
We've included handy conversions for those images.
500 hours is a little suspicious though. Do you have enough data workers? Do you have too many? Can you check what htop says the CPU utilization is?
Thank you for your prompt reply! I set the number of workers to 16, and the batch size to 4 on each GPU. I cannot check the CPU utilization for some reason. When I decrease the number of workers, the training time will increase in my memory.
Looks like you're IO limited. Using precached images and/or SSDs should help.
Thank you for your great work!
I have a question about the training time: How long does it take to train the network?
In my understanding, it takes 36 hours for 9 epochs on 2 A100, is it correct? But when I run the code, it takes around 500 hours on 8 V100. One difference is that I don't have cached low-resolution depth maps, because it takes too much disk space. Can it make such a big difference? Do you have any suggestions?
Thank you for your help!