open-mmlab / mmtracking

OpenMMLab Video Perception Toolbox. It supports Video Object Detection (VID), Multiple Object Tracking (MOT), Single Object Tracking (SOT), Video Instance Segmentation (VIS) with a unified framework.
https://mmtracking.readthedocs.io/en/latest/
Apache License 2.0
3.52k stars 591 forks source link

Received the following error upon using dist_train with 8 GPUs._pickle.UnpicklingError: pickle data was truncated. #350

Open khurramHashmi opened 2 years ago

khurramHashmi commented 2 years ago

By using any configuration with dist_train on 8 GPUs will give the following error: UnpicklingError: pickle data was truncated

GT9505 commented 2 years ago

Please use error_report.md to report bug.

khurramHashmi commented 2 years ago

name: Error report about: Create a report to help us improve title: 'UnpicklingError: pickle data was truncated.' labels: '' assignees: ''


Checklist I have searched related issues but cannot get the expected help. The bug has not been fixed in the latest version.

By using any configuration with dist_train on 8 GPUs will give the following error: UnpicklingError: pickle data was truncated

Reproduction

  1. Command : CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 PORT=29502 tools/dist_train.sh configs/vid/temporal_roi_align/selsa_troialign_faster_rcnn_r50_dc5_7e_imagenetvid.py 8

  2. Did you make any modifications on the code or config? Did you understand what you have modified? No modifications

  3. What dataset did you use and what task did you run? ImageNet Vid val set.

Environment

sys.platform: linux Python: 3.7.9 (default, Aug 31 2020, 12:42:55) [GCC 7.3.0] CUDA available: True GPU 0: GeForce GTX 1080 Ti GPU 1: GeForce GT 1030 CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 10.2, V10.2.89 GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 PyTorch: 1.7.0 PyTorch compiling details: PyTorch built with:

TorchVision: 0.8.0 OpenCV: 4.5.1 MMCV: 1.3.8 MMCV Compiler: GCC 7.3 MMCV CUDA Compiler: 10.2 MMTracking: 0.8.0+4d68645

Error traceback ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 1 (pid: 2850996) of binary: /opt/conda/bin/python ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result: restart_count=1 master_addr=127.0.0.1 master_port=29505 group_rank=0 group_world_size=1 local_ranks=[0, 1, 2, 3, 4, 5, 6, 7] role_ranks=[0, 1, 2, 3, 4, 5, 6, 7] global_ranks=[0, 1, 2, 3, 4, 5, 6, 7] role_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8] global_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group

khurramHashmi commented 2 years ago

@GT9505 Could you please respond?

GT9505 commented 2 years ago

Please fill the Error traceback. Besides, what command did you run? In the Reproduction, you use tools/dist_test.sh with 4 gpus, while in the Checklist, you use dist_train on 8 GPUs to get the error. Please make the error information more clearly.

Generally, UnpicklingError: pickle data was truncated means that the file is damaged when you use pickle.load to load the file. You can try generating the file again.

khurramHashmi commented 2 years ago

I have corrected the command. The pickle file seems to be fine since it totally works with 4 GPUs. The problem initiates on distributed training with 8 GPUs.

GT9505 commented 2 years ago

This is very weird. As mentioned in here, when setting num_worker_per_gpu=1, you can run distributed training with 8 GPUs. While setting num_worker_per_gpu=2, distributed training on 8 gpus will raise the error.

khurramHashmi commented 2 years ago

Indeed, it is weird. Even the weirder part is that it used to work before with GPUs. Also, I have tried with num_worker_per_gpu=1. It did not work.