Open vedantroy opened 2 years ago
cc: @ejguan @VitalyFedyunin
I am not sure if https://github.com/pytorch/pytorch/pull/85279 addresses this. I will defer to the data loader POCs.
This should be unrelated to worker process within DataLoader because worker processes on your rank 2 should haven't been created by that time. Could you please try to use pytorch nightly release to see if the Error still persists?
@ejguan I will try it out. On thing to note, that I'm 40% sure might be the issue is that my data loaders have different length. This means the rank 2 data loader could (for example) finish before the rank 0 data loader. Does your intuition match mine that this could be an issue?
Then, it happens on the beginning of the second epoch. I would recommend you to attach datapipe.fullsync()
at the end your pipeline, which is newly introduced in torchdata, which will synchronize the length of data across ranks.
See: https://github.com/pytorch/data/blob/9ad8efb476baab2fae4435bcb8923b6cd2c828f1/torchdata/datapipes/iter/util/prefetch.py#L108-L109
🐛 Describe the bug
On a 8xA100-40GB SXM machine, with 6 workers per training process, after a few hours I get the following error:
Here's what happens prior to this error:
nvidia-smi
htop
, 7 cores (not always the same cores) out of my 124 cores are constantly at 100%, the rest are barely active. It's vaguely suspicious that the number of CPU cores at 100% is equal to the number of workers + 1kill -s SIGUSR1 <pid>
to a dataloader worker pid for the rank 2 GPU, I will get the response "no process with this pid" (or whatever the exact error is). However, if I run that command with the pid for the dataloader worker of a different rank, I won't receive an errorVersions
PyTorch version: 1.12.1+cu116 Is debug build: False CUDA used to build PyTorch: 11.6 ROCM used to build PyTorch: N/A OS: Ubuntu 20.04.4 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 Clang version: Could not collect CMake version: Could not collect Libc version: glibc-2.31 Python version: 3.8.13 (default, Mar 28 2022, 11:38:47) [GCC 7.5.0] (64-bit runtime) Python platform: Linux-5.15.0-46-generic-x86_64-with-glibc2.17 Is CUDA available: True CUDA runtime version: 11.6.124 GPU models and configuration: GPU 0: NVIDIA A100-SXM4-40GB GPU 1: NVIDIA A100-SXM4-40GB GPU 2: NVIDIA A100-SXM4-40GB GPU 3: NVIDIA A100-SXM4-40GB GPU 4: NVIDIA A100-SXM4-40GB GPU 5: NVIDIA A100-SXM4-40GB GPU 6: NVIDIA A100-SXM4-40GB GPU 7: NVIDIA A100-SXM4-40GB
Nvidia driver version: 510.47.03 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True Versions of relevant libraries: [pip3] numpy==1.22.4 [pip3] pytorch-ranger==0.1.1 [pip3] torch==1.12.1+cu116 [pip3] torch-optimizer==0.1.0 [pip3] torchdata==0.4.1 [pip3] torchmetrics==0.7.3 [pip3] torchvision==0.13.1a0+bddbd7e [conda] Could not collect Pillow/Pillow-SIMD version: 7.0.0.post3 Postfix means using pillow-simd
cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang @kwen2501 @SsnL @VitalyFedyunin @ejguan @NivekT