Closed WeianMao closed 3 years ago
https://pytorch.org/docs/master/data.html#torch.utils.data.DataLoader
In Pytorch > 1.7, there is a new feature called persistent_workers (Default: False). If True, the data loader will not shutdown the worker processes after a dataset has been consumed once.
We will support this feature, like https://github.com/open-mmlab/mmocr/pull/459
https://pytorch.org/docs/master/data.html#torch.utils.data.DataLoader
In Pytorch > 1.7, there is a new feature called persistent_workers (Default: False). If True, the data loader will not shutdown the worker processes after a dataset has been consumed once.
We will support this feature, like open-mmlab/mmocr#459
so, if i set the persistent_workers to True, this issue will be fixed? thanks
Please have a try.
Please have a try.
yes, it works. thank you very much! the time reduce from 180s to 2s. i think you can update this thing to mmpsoe, i will speed up the training significantly on the fast server.
Please have a try.
BTW, i think the flag, (find_unused_parameters = cfg.get('find_unused_parameters', True)), should be set as False.
Setting find_unused_parameters=True
will prevent DDP from waiting for absent gradients forever during the backward pass. In most cases, I think it is worth introducing some extra overheads (traversing the autograd graph) to avoid being stuck.
Setting
find_unused_parameters=True
will prevent DDP from waiting for absent gradients forever during the backward pass. In most cases, I think it is worth introducing some extra overheads (traversing the autograd graph) to avoid being stuck.
if i'm right, there will be a error when some model parameter are not in the autograd gragh (find_unused_parameters=False). if we set the find_unused_parameters=True, the error will disappear. however, it is a model bug that some model parameter are not in the autograd gragh. in such case, i think we should know bug. and previously, if find_unused_parameters=True, the menory consumption will be larger and training will be slow down. (but in mmpose, it seems like the menory and speed is the same)
i found a dataset bug, i test it on several server(including 8 a100 with 96 core cpu), it all happened. for every epoch, this bug cause about 3min time wasting. i jsut can locat the bug, but i don't known why it happen. it seems only happen when distribution launching.
bug loaction: when you lauch a topdown method, eg, topdown_heatmap/coco/res50_coco_256x192.py, go to /mmcv/runner/epoch_based_runner.py, about line 48. there is such func
at the every epoch begining, the ( for i, data_batch in enumerate(self.data_loader): ) takes about 3min, it make the training very slow.
you can modify the ori code to the code below to reproduce this issue, this only happen at very epoch begining.
here is my sys information Python: 3.7.10 | packaged by conda-forge | (default, Feb 19 2021, 16:07:37) [GCC 9.3.0]
CUDA available: True GPU 0,1,2,3,4,5,6,7: A100-SXM4-40GB
CUDA_HOME: /usr/local/cuda-11.1
NVCC: Build cuda_11.1.TC455_06.29190527_0 GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.8.1+cu111
PyTorch compiling details: PyTorch built with:
TorchVision: 0.9.1+cu111
OpenCV: 4.5.3
MMCV: 1.3.8
MMCV Compiler: GCC 7.5 MMCV CUDA Compiler: 11.1 MMPose: 0.15.0+51b4b45