kenshohara / 3D-ResNets-PyTorch

3D ResNets for Action Recognition (CVPR 2018)
MIT License
3.9k stars 932 forks source link

Slow dataloading every N batches, where N=num_threads #252

Open exnx opened 3 years ago

exnx commented 3 years ago

I’m trying to train a 3D resnet model for classification, but the training time is doing something weird. I have 28 cpus, and 28 num_workers for dataloading, and so every 28th iteration takes about a min, while the first 1-27 iterations takes 0 secs. I tried different num_workers, and it’s the same pattern, every nth num_workers it gets held up some order of magnitude of time longer, while the other iterations are very fast. Not sure if anybody is familiar with this kind of behavior?

I know it's the data loading that is held up because the training code has total batch time, and data loading time broken out. It seems like the dataloader is waiting for all the workers / threads to finish before proceeding, that's my guess. Does anybody have any remedies for this?

guilhermesurek commented 3 years ago

Hello @exnx, I do not have an answer, but I can share things that I passed through.

First I tried to undertand CPU/GPU training time for a fixed batch size, second I tried understand the data loading time uping and dowing the num_workers, and then how training and data loading time vary with batch_size.

Generally, your goal is to maximize training time as it is the most limited resource. To do this, you need to track the % of your CPU / GPU usage. The more workers the more inputs your CPU / GPU will have without having to wait for the data to load (when you say that the 28th it. took 1 min, is that the workers are loading the data). However, RAM starts to become a problem, unless you have a lot of RAM. So, you will have to balance batch_size and num_workers to achieve this goal, with the resources you have, and / or other goals that you also could have with batch_size.

PS: I think you should have at least 4 workers per CPU.

exnx commented 3 years ago

Interesting, thanks for the thoughtful response.

So I am using a cluster at school, and have 28 cpu (or cores?) available. I've been doing 1 worker/cpu, which I thought was the optimum? I can try doing more, it just seems like 4/cpu sounds super high! I tried lower workers, but higher seems faster over all. I am using 2 gpus at the moment.