microsoft / VideoX

VideoX: a collection of video cross-modal models
Other
968 stars 160 forks source link

RuntimeError: DataLoader worker (pid 5423) is killed by signal: Segmentation fault. #62

Closed fixedwater closed 2 years ago

fixedwater commented 2 years ago

ERROR: Unexpected segmentation fault encountered in worker. Traceback (most recent call last): File "main.py", line 369, in main(config) File "main.py", line 121, in main train_one_epoch(epoch, model, criterion, optimizer, lr_scheduler, train_loader, text_labels, config, mixup_fn) File "main.py", line 203, in train_one_epoch scaled_loss.backward() File "/root/miniconda3/envs/env/lib/python3.7/contextlib.py", line 119, in exit next(self.gen) File "/root/miniconda3/envs/env/lib/python3.7/site-packages/apex-0.1-py3.7.egg/apex/amp/handle.py", line 123, in scale_loss File "/root/miniconda3/envs/env/lib/python3.7/site-packages/apex-0.1-py3.7.egg/apex/amp/_process_optimizer.py", line 249, in post_backward_no_master_weights File "/root/miniconda3/envs/env/lib/python3.7/site-packages/apex-0.1-py3.7.egg/apex/amp/_process_optimizer.py", line 135, in post_backward_models_are_masters File "/root/miniconda3/envs/env/lib/python3.7/site-packages/apex-0.1-py3.7.egg/apex/amp/scaler.py", line 184, in unscale_with_stashed File "/root/miniconda3/envs/env/lib/python3.7/site-packages/apex-0.1-py3.7.egg/apex/amp/scaler.py", line 148, in unscale_with_stashed_python File "/root/miniconda3/envs/env/lib/python3.7/site-packages/apex-0.1-py3.7.egg/apex/amp/scaler.py", line 22, in axpby_check_overflow_python File "/root/miniconda3/envs/env/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler _error_if_any_worker_fails() RuntimeError: DataLoader worker (pid 5423) is killed by signal: Segmentation fault. Killing subprocess 4388

Hi there, i got a dataloader error above during some iterations of epoch. Do yo have any idea about that ? These are part of parameters: python -m torch.distributed.launch --nproc_per_node=1 main.py -cfg configs/k600/16_8.yaml --output . --accumulation-steps 2 --resume /data/xxxx/xclip/VideoX-master/X-CLIP/pretrained_models/k600_16_8.pth

batch size is 8

nbl97 commented 2 years ago

This issue seems to be related to apex, so you can try to reinstall apex, or adjust the num workers

fixedwater commented 2 years ago

thx. I tried those solutions, but error was still raised. I found it might related to my dataset.

nbl97 commented 2 years ago

Hope your project goes well. I'm closing this issue, but pls feel free to ping me if there are further questions.