scutpaul / DANet

30 stars 3 forks source link

possible deadlock in dataloader #8

Closed nankepan closed 1 year ago

nankepan commented 2 years ago

HI, When I train a model with num_workers>1, it is possible to stuck on this line: https://github.com/scutpaul/DANet/blob/f0bc57d9b2641c4dda9ce70e2c6f240ce2789069/test_DAN.py#L137 Then I debug and find that it stucks on this tow lines: https://github.com/scutpaul/DANet/blob/f0bc57d9b2641c4dda9ce70e2c6f240ce2789069/libs/dataset/YoutubeVOS.py#L156 https://github.com/scutpaul/DANet/blob/f0bc57d9b2641c4dda9ce70e2c6f240ce2789069/libs/dataset/YoutubeVOS.py#L157 When I train a model when num_workers=0, it is normal but very slow.

The problem is similiar with this: https://github.com/pytorch/pytorch/issues/1355. And I can not fix it using methods under this issue.
How can I fix the problem?

scutpaul commented 2 years ago

hi, if you need to train the model, you should use train_DAN.py. the default setting for training num_workers is 4 https://github.com/scutpaul/DANet/blob/f0bc57d9b2641c4dda9ce70e2c6f240ce2789069/train_DAN.py#L46 https://github.com/scutpaul/DANet/blob/f0bc57d9b2641c4dda9ce70e2c6f240ce2789069/train_DAN.py#L82

nankepan commented 2 years ago

I did use train_DAN.py and set num_workers=4.Then sometime it is possible to stuck.

scutpaul commented 2 years ago

hi, you can download our conda yaml to create the python env. FSVOS.yaml.zip