Closed Lily-Le closed 6 months ago
Hey @Lily-Le. Is this happening only for a specific method or all of them? How's your CPU and ram usage?
The 1024 batch is per-gpu or in total? Maybe your CPU/ram/disk combination is having trouble with the amount of data being cached.
Hi. Thanks for the quick reply! :D
The setting is 256 batch size per-gpu and 1024 in total. When loading data, GPU Utility is 100%. CPU usage is small.
I figured that there's something to do with my dali installation under cuda12. The same code works well on imagenet on my original server with cuda11 in smaller batches. On the new server with cuda12, the cifar dataset works well while imagenet100 encounters the same problem.
Thanks so much! Have a nice day. :D
Problem solved. It's caused by the communication problem with multi-gpu training. The new server does not support NVLink. For me I need to add NCCL_P2P_LEVEL=NVL
and then everything gets back to normal.
Hi, thanks for your work for ssl learning.
I tried to pretrain a model on imagenet dataset with 1024 batch size, but the program stuck before sanity checking. It can work well on some servers (eg, 3090, Intel 8350c), but for others it gets stuck. I think there may be something to do with the cpu type, but setting num_workers to 0 does not help.
What may be the possible reasons for it? And what are the possible solutions? Thanks a lot!