Timed out receiving the shared seed from the distribtued store on Rank 2

vedantroy commented 2 years ago

🐛 Describe the bug

On a 8xA100-40GB SXM machine, with 6 workers per training process, after a few hours I get the following error:

contrastive_train-contrastive_train-1  |     trainer.fit()                                                                                                                                              [0/1942]
contrastive_train-contrastive_train-1  |   File "/root/micromamba/envs/video-rec/lib/python3.8/site-packages/composer/trainer/trainer.py", line 1386, in fit
contrastive_train-contrastive_train-1  |     self._train_loop()                                                                                                                                                 
contrastive_train-contrastive_train-1  |   File "/root/micromamba/envs/video-rec/lib/python3.8/site-packages/composer/trainer/trainer.py", line 1512, in _train_loop
contrastive_train-contrastive_train-1  |     for batch_idx, self.state.batch in enumerate(self._iter_dataloader(TrainerMode.TRAIN)):
contrastive_train-contrastive_train-1  |   File "/root/micromamba/envs/video-rec/lib/python3.8/site-packages/composer/trainer/trainer.py", line 2194, in _iter_dataloader
contrastive_train-contrastive_train-1  |     dataloader_iter = iter(self.state.dataloader)              
contrastive_train-contrastive_train-1  |   File "/root/micromamba/envs/video-rec/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 444, in __iter__
contrastive_train-contrastive_train-1  |     return self._get_iterator()                                
contrastive_train-contrastive_train-1  |   File "/root/micromamba/envs/video-rec/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 390, in _get_iterator
contrastive_train-contrastive_train-1  |     return _MultiProcessingDataLoaderIter(self)                
contrastive_train-contrastive_train-1  |   File "/root/micromamba/envs/video-rec/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1038, in __init__
contrastive_train-contrastive_train-1  |     super(_MultiProcessingDataLoaderIter, self).__init__(loader)
contrastive_train-contrastive_train-1  |   File "/root/micromamba/envs/video-rec/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 623, in __init__
contrastive_train-contrastive_train-1  |     self._shared_seed = loader._get_shared_seed()              
contrastive_train-contrastive_train-1  |   File "/root/micromamba/envs/video-rec/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 604, in _get_shared_seed
contrastive_train-contrastive_train-1  |     raise RuntimeError("Timed out receiving the shared seed from the distribtued store "
contrastive_train-contrastive_train-1  | RuntimeError: Timed out receiving the shared seed from the distribtued store on Rank 2. (world_size=8, timeout=1800)
contrastive_train-contrastive_train-1  |                                                                
contrastive_train-contrastive_train-1  |                                                                
contrastive_train-contrastive_train-1  |                                                                                                                         
contrastive_train-contrastive_train-1  |                                                                
contrastive_train-contrastive_train-1  | ----------End global rank 2 STDERR----------ERROR:composer.cli.launcher:Global rank 0 (PID 162) exited with code -15

Here's what happens prior to this error:

1 GPU goes to 0%, the other 7 remain at 100% in nvidia-smi
In htop, 7 cores (not always the same cores) out of my 124 cores are constantly at 100%, the rest are barely active. It's vaguely suspicious that the number of CPU cores at 100% is equal to the number of workers + 1
The worker processes for the rank 2 GPU have died. I.e, if I send kill -s SIGUSR1 <pid> to a dataloader worker pid for the rank 2 GPU, I will get the response "no process with this pid" (or whatever the exact error is). However, if I run that command with the pid for the dataloader worker of a different rank, I won't receive an error

Versions

PyTorch version: 1.12.1+cu116 Is debug build: False CUDA used to build PyTorch: 11.6 ROCM used to build PyTorch: N/A OS: Ubuntu 20.04.4 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 Clang version: Could not collect CMake version: Could not collect Libc version: glibc-2.31 Python version: 3.8.13 (default, Mar 28 2022, 11:38:47) [GCC 7.5.0] (64-bit runtime) Python platform: Linux-5.15.0-46-generic-x86_64-with-glibc2.17 Is CUDA available: True CUDA runtime version: 11.6.124 GPU models and configuration: GPU 0: NVIDIA A100-SXM4-40GB GPU 1: NVIDIA A100-SXM4-40GB GPU 2: NVIDIA A100-SXM4-40GB GPU 3: NVIDIA A100-SXM4-40GB GPU 4: NVIDIA A100-SXM4-40GB GPU 5: NVIDIA A100-SXM4-40GB GPU 6: NVIDIA A100-SXM4-40GB GPU 7: NVIDIA A100-SXM4-40GB

Nvidia driver version: 510.47.03 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True Versions of relevant libraries: [pip3] numpy==1.22.4 [pip3] pytorch-ranger==0.1.1 [pip3] torch==1.12.1+cu116 [pip3] torch-optimizer==0.1.0 [pip3] torchdata==0.4.1 [pip3] torchmetrics==0.7.3 [pip3] torchvision==0.13.1a0+bddbd7e [conda] Could not collect Pillow/Pillow-SIMD version: 7.0.0.post3 Postfix means using pillow-simd

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang @kwen2501 @SsnL @VitalyFedyunin @ejguan @NivekT

awgu commented 2 years ago

cc: @ejguan @VitalyFedyunin

I am not sure if https://github.com/pytorch/pytorch/pull/85279 addresses this. I will defer to the data loader POCs.

ejguan commented 2 years ago

This should be unrelated to worker process within DataLoader because worker processes on your rank 2 should haven't been created by that time. Could you please try to use pytorch nightly release to see if the Error still persists?

vedantroy commented 2 years ago

@ejguan I will try it out. On thing to note, that I'm 40% sure might be the issue is that my data loaders have different length. This means the rank 2 data loader could (for example) finish before the rank 0 data loader. Does your intuition match mine that this could be an issue?

ejguan commented 2 years ago

Then, it happens on the beginning of the second epoch. I would recommend you to attach datapipe.fullsync() at the end your pipeline, which is newly introduced in torchdata, which will synchronize the length of data across ranks. See: https://github.com/pytorch/data/blob/9ad8efb476baab2fae4435bcb8923b6cd2c828f1/torchdata/datapipes/iter/util/prefetch.py#L108-L109

pytorch / pytorch

Timed out receiving the shared seed from the distribtued store on Rank 2 #85775

🐛 Describe the bug

Versions