vturrisi / solo-learn

solo-learn: a library of self-supervised methods for visual representation learning powered by Pytorch Lightning
MIT License
1.38k stars 181 forks source link

Code freezes before sanity checking #383

Closed Lily-Le closed 6 months ago

Lily-Le commented 6 months ago

Hi, thanks for your work for ssl learning.

I tried to pretrain a model on imagenet dataset with 1024 batch size, but the program stuck before sanity checking. It can work well on some servers (eg, 3090, Intel 8350c), but for others it gets stuck. I think there may be something to do with the cpu type, but setting num_workers to 0 does not help.

What may be the possible reasons for it? And what are the possible solutions? Thanks a lot!

image

vturrisi commented 6 months ago

Hey @Lily-Le. Is this happening only for a specific method or all of them? How's your CPU and ram usage?

The 1024 batch is per-gpu or in total? Maybe your CPU/ram/disk combination is having trouble with the amount of data being cached.

Lily-Le commented 6 months ago

Hi. Thanks for the quick reply! :D
The setting is 256 batch size per-gpu and 1024 in total. When loading data, GPU Utility is 100%. CPU usage is small.

I figured that there's something to do with my dali installation under cuda12. The same code works well on imagenet on my original server with cuda11 in smaller batches. On the new server with cuda12, the cifar dataset works well while imagenet100 encounters the same problem.

Thanks so much! Have a nice day. :D

Lily-Le commented 6 months ago

Problem solved. It's caused by the communication problem with multi-gpu training. The new server does not support NVLink. For me I need to add NCCL_P2P_LEVEL=NVL and then everything gets back to normal.