Closed Jinming-Su closed 3 years ago
When I train this code for some iterations, this problem happends.
2021-03-01 17:21:06,014 | callback.py | line 40 : Batch [8980] Speed: 1.95 samples/sec Train-rpn_cls_loss=0.151620, rpn_bbox_loss=0.572483, rcnn_accuracy=0.862228, cls_loss=0.421914, bbox_loss=0.215375, mask_loss=0.371646, fcn_loss=0.488014, panoptic_accuracy=0.873215, panoptic_loss=0.429976, Traceback (most recent call last): File "upsnet/upsnet_end2end_train.py", line 417, in <module> upsnet_train() File "upsnet/upsnet_end2end_train.py", line 379, in upsnet_train output = train_model(*batch) File "/home/.conda/envs/mmdet2.4/lib/python3.8/site-packages/torch/nn/modules/module.py", line 532, in __call__ result = self.forward(*input, **kwargs) File "upsnet/../lib/utils/data_parallel.py", line 112, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "upsnet/../lib/utils/data_parallel.py", line 125, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/.conda/envs/mmdet2.4/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 77, in parallel_apply thread.join() File "/home/.conda/envs/mmdet2.4/lib/python3.8/threading.py", line 1011, in join self._wait_for_tstate_lock() File "/home/.conda/envs/mmdet2.4/lib/python3.8/threading.py", line 1027, in _wait_for_tstate_lock elif lock.acquire(block, timeout): File "/home/.conda/envs/mmdet2.4/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler _error_if_any_worker_fails()
I am training on a docker with 4 GPUs. Do you know the cause?
Closed. Remove 'pin_memory' can solve this problem.
When I train this code for some iterations, this problem happends.
I am training on a docker with 4 GPUs. Do you know the cause?