Open khurramHashmi opened 2 years ago
Please use error_report.md to report bug.
name: Error report about: Create a report to help us improve title: 'UnpicklingError: pickle data was truncated.' labels: '' assignees: ''
Checklist I have searched related issues but cannot get the expected help. The bug has not been fixed in the latest version.
By using any configuration with dist_train on 8 GPUs will give the following error: UnpicklingError: pickle data was truncated
Reproduction
Command : CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 PORT=29502 tools/dist_train.sh configs/vid/temporal_roi_align/selsa_troialign_faster_rcnn_r50_dc5_7e_imagenetvid.py 8
Did you make any modifications on the code or config? Did you understand what you have modified? No modifications
What dataset did you use and what task did you run? ImageNet Vid val set.
Environment
sys.platform: linux Python: 3.7.9 (default, Aug 31 2020, 12:42:55) [GCC 7.3.0] CUDA available: True GPU 0: GeForce GTX 1080 Ti GPU 1: GeForce GT 1030 CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 10.2, V10.2.89 GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 PyTorch: 1.7.0 PyTorch compiling details: PyTorch built with:
TorchVision: 0.8.0 OpenCV: 4.5.1 MMCV: 1.3.8 MMCV Compiler: GCC 7.3 MMCV CUDA Compiler: 10.2 MMTracking: 0.8.0+4d68645
Error traceback ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 1 (pid: 2850996) of binary: /opt/conda/bin/python ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result: restart_count=1 master_addr=127.0.0.1 master_port=29505 group_rank=0 group_world_size=1 local_ranks=[0, 1, 2, 3, 4, 5, 6, 7] role_ranks=[0, 1, 2, 3, 4, 5, 6, 7] global_ranks=[0, 1, 2, 3, 4, 5, 6, 7] role_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8] global_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
@GT9505 Could you please respond?
Please fill the Error traceback
.
Besides, what command did you run? In the Reproduction
, you use tools/dist_test.sh
with 4 gpus, while in the Checklist
, you use dist_train
on 8 GPUs to get the error.
Please make the error information more clearly.
Generally, UnpicklingError: pickle data was truncated
means that the file is damaged when you use pickle.load
to load the file.
You can try generating the file again.
I have corrected the command. The pickle file seems to be fine since it totally works with 4 GPUs. The problem initiates on distributed training with 8 GPUs.
This is very weird.
As mentioned in here, when setting num_worker_per_gpu=1
, you can run distributed training with 8 GPUs. While setting num_worker_per_gpu=2
, distributed training on 8 gpus will raise the error.
Indeed, it is weird. Even the weirder part is that it used to work before with GPUs.
Also, I have tried with num_worker_per_gpu=1
. It did not work.
By using any configuration with dist_train on 8 GPUs will give the following error: UnpicklingError: pickle data was truncated