Received the following error upon using dist_train with 8 GPUs._pickle.UnpicklingError: pickle data was truncated.

khurramHashmi commented 2 years ago

By using any configuration with dist_train on 8 GPUs will give the following error: UnpicklingError: pickle data was truncated

GT9505 commented 2 years ago

Please use error_report.md to report bug.

khurramHashmi commented 2 years ago

name: Error report about: Create a report to help us improve title: 'UnpicklingError: pickle data was truncated.' labels: '' assignees: ''

Checklist I have searched related issues but cannot get the expected help. The bug has not been fixed in the latest version.

By using any configuration with dist_train on 8 GPUs will give the following error: UnpicklingError: pickle data was truncated

Reproduction

Command : CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 PORT=29502 tools/dist_train.sh configs/vid/temporal_roi_align/selsa_troialign_faster_rcnn_r50_dc5_7e_imagenetvid.py 8
Did you make any modifications on the code or config? Did you understand what you have modified? No modifications
What dataset did you use and what task did you run? ImageNet Vid val set.

Environment

sys.platform: linux Python: 3.7.9 (default, Aug 31 2020, 12:42:55) [GCC 7.3.0] CUDA available: True GPU 0: GeForce GTX 1080 Ti GPU 1: GeForce GT 1030 CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 10.2, V10.2.89 GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 PyTorch: 1.7.0 PyTorch compiling details: PyTorch built with:

GCC 7.3
C++ Version: 201402
Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v1.6.0 (Git Hash 5ef631a030a6f73131c77892041042805a06064f)
OpenMP 201511 (a.k.a. OpenMP 4.5)
NNPACK is enabled
CPU capability usage: AVX2
CUDA Runtime 10.2
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
CuDNN 7.6.5
Magma 2.5.2
Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.8.0 OpenCV: 4.5.1 MMCV: 1.3.8 MMCV Compiler: GCC 7.3 MMCV CUDA Compiler: 10.2 MMTracking: 0.8.0+4d68645

Error traceback ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 1 (pid: 2850996) of binary: /opt/conda/bin/python ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result: restart_count=1 master_addr=127.0.0.1 master_port=29505 group_rank=0 group_world_size=1 local_ranks=[0, 1, 2, 3, 4, 5, 6, 7] role_ranks=[0, 1, 2, 3, 4, 5, 6, 7] global_ranks=[0, 1, 2, 3, 4, 5, 6, 7] role_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8] global_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group

khurramHashmi commented 2 years ago

@GT9505 Could you please respond?

GT9505 commented 2 years ago

Please fill the Error traceback. Besides, what command did you run? In the Reproduction, you use tools/dist_test.sh with 4 gpus, while in the Checklist, you use dist_train on 8 GPUs to get the error. Please make the error information more clearly.

Generally, UnpicklingError: pickle data was truncated means that the file is damaged when you use pickle.load to load the file. You can try generating the file again.

khurramHashmi commented 2 years ago

I have corrected the command. The pickle file seems to be fine since it totally works with 4 GPUs. The problem initiates on distributed training with 8 GPUs.

GT9505 commented 2 years ago

This is very weird. As mentioned in here, when setting num_worker_per_gpu=1, you can run distributed training with 8 GPUs. While setting num_worker_per_gpu=2, distributed training on 8 gpus will raise the error.

khurramHashmi commented 2 years ago

Indeed, it is weird. Even the weirder part is that it used to work before with GPUs. Also, I have tried with num_worker_per_gpu=1. It did not work.

open-mmlab / mmtracking

Received the following error upon using dist_train with 8 GPUs._pickle.UnpicklingError: pickle data was truncated. #350