open-mmlab / mmocr

OpenMMLab Text Detection, Recognition and Understanding Toolbox
https://mmocr.readthedocs.io/en/dev-1.x/
Apache License 2.0
4.29k stars 744 forks source link

[Bug] Resume training from different GPU #1952

Open Daisy5296 opened 1 year ago

Daisy5296 commented 1 year ago

Prerequisite

Task

I have modified the scripts/configs, or I'm working on my own tasks/models/datasets.

Branch

main branch https://github.com/open-mmlab/mmocr

Environment

sys.platform: linux Python: 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:10) [GCC 10.3.0] CUDA available: True numpy_random_seed: 2147483648 GPU 0,1: Tesla P100-PCIE-16GB CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 11.7, V11.7.99 GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 PyTorch: 1.13.0a0+340c412 PyTorch compiling details: PyTorch built with:

GCC 9.4 C++ Version: 201402 Intel(R) Math Kernel Library Version 2020.0.4 Product Build 20200917 for Intel(R) 64 architecture applications Intel(R) MKL-DNN v2.6.0 (Git Hash N/A) OpenMP 201511 (a.k.a. OpenMP 4.5) LAPACK is enabled (usually provided by MKL) NNPACK is enabled CPU capability usage: AVX512 CUDA Runtime 11.7 NVCC architecture flags: -gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_86,code=compute_86 CuDNN 8.4.1 (built against CUDA 11.6) Magma 2.6.2 Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.7, CUDNN_VERSION=8.4.1, CXX_COMPILER=/usr/bin/c++, CXX_FLAGS=-fno-gnu-unique -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.13.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=ON, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, TorchVision: 0.13.0a0 OpenCV: 4.2.0 MMEngine: 0.7.2 MMOCR: 1.0.0+d7c59f3

Reproduces the problem - code sample

Here is my config file:

base = [ '_base_dbnet_resnet18_fpnc.py', '../base/datasets/icdar2015.py', '../base/datasets/rects.py', '../base/default_runtime.py', '../base/schedules/schedule_sgd_1200e.py', ]

train_list = [base.icdar2015_textdet_train, base.rects_textdet_train] test_list = [base.icdar2015_textdet_test, base.rects_textdet_test]

train_dataset = dict(type='ConcatDataset', datasets=train_list, pipeline=base.train_pipeline) test_dataset = dict(type='ConcatDataset', datasets=test_list, pipeline=base.test_pipeline)

train_dataloader = dict( batch_size=32, num_workers=8, persistent_workers=True, sampler=dict(type='DefaultSampler', shuffle=True), dataset=train_dataset)

val_dataloader = dict( batch_size=1, num_workers=4, persistent_workers=True, sampler=dict(type='DefaultSampler', shuffle=False), dataset=test_dataset)

test_dataloader = val_dataloader

load_from = '/dfs/data/mmocr/work_dirs/epoch80.pth' resume = False train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=3000, val_interval=20) optim_wrapper = dict( optimizer=dict(type='SGD', lr=0.0007, momentum=0.9, weight_decay=0.00001))

auto_scale_lr = dict(base_batch_size=32)

Reproduces the problem - command or script

I started training using the following command: CUDA_VISIBLE_DEVICES=0 python tools/train.py configs/textdet/dbnet/dbnet_resnet18_fpnc_1200e_rects_icdar2015.py

and then I resume training to use 2 GPUs using: tools/dist_train.sh configs/textdet/dbnet/dbnet_resnet18_fpnc_1200e_rects_icdar2015.py 2

Reproduces the problem - error message

The following error occurs when I change to 2 GPU: /opt/conda/lib/python3.8/site-packages/torch/cuda/init.py:83: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 803: system has unsupported display driver / cuda driver combination (Triggered internally at /opt/pytorch/pytorch/c10/cuda/CUDAFunctions.cpp:109.) return torch._C._cuda_getDeviceCount() > 0 /opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects --local_rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions

warnings.warn( WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


/opt/conda/lib/python3.8/site-packages/torch/cuda/init.py:83: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 803: system has unsupported display driver / cuda driver combination (Triggered internally at /opt/pytorch/pytorch/c10/cuda/CUDAFunctions.cpp:109.) return torch._C._cuda_getDeviceCount() > 0 /opt/conda/lib/python3.8/site-packages/torch/cuda/init.py:83: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 803: system has unsupported display driver / cuda driver combination (Triggered internally at /opt/pytorch/pytorch/c10/cuda/CUDAFunctions.cpp:109.) return torch._C._cuda_getDeviceCount() > 0 /opt/conda/lib/python3.8/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. warnings.warn( /opt/conda/lib/python3.8/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. warnings.warn( Traceback (most recent call last): Traceback (most recent call last): File "tools/train.py", line 114, in File "tools/train.py", line 114, in main() File "tools/train.py", line 103, in main main() File "tools/train.py", line 103, in main runner = Runner.from_cfg(cfg) File "/opt/conda/lib/python3.8/site-packages/mmengine/runner/runner.py", line 439, in from_cfg runner = Runner.from_cfg(cfg) File "/opt/conda/lib/python3.8/site-packages/mmengine/runner/runner.py", line 439, in from_cfg runner = cls( File "/opt/conda/lib/python3.8/site-packages/mmengine/runner/runner.py", line 353, in init runner = cls( File "/opt/conda/lib/python3.8/site-packages/mmengine/runner/runner.py", line 353, in init self.setup_env(env_cfg) File "/opt/conda/lib/python3.8/site-packages/mmengine/runner/runner.py", line 652, in setup_env self.setup_env(env_cfg) File "/opt/conda/lib/python3.8/site-packages/mmengine/runner/runner.py", line 652, in setup_env init_dist(self.launcher, dist_cfg) File "/opt/conda/lib/python3.8/site-packages/mmengine/dist/utils.py", line 70, in init_dist init_dist(self.launcher, dist_cfg) File "/opt/conda/lib/python3.8/site-packages/mmengine/dist/utils.py", line 70, in init_dist _init_dist_pytorch(backend, kwargs)
_init_dist_pytorch(backend,
kwargs) File "/opt/conda/lib/python3.8/site-packages/mmengine/dist/utils.py", line 107, in _init_dist_pytorch File "/opt/conda/lib/python3.8/site-packages/mmengine/dist/utils.py", line 107, in _init_dist_pytorch torch.cuda.set_device(local_rank)torch.cuda.set_device(local_rank)

File "/opt/conda/lib/python3.8/site-packages/torch/cuda/init.py", line 314, in set_device File "/opt/conda/lib/python3.8/site-packages/torch/cuda/init.py", line 314, in set_device torch._C._cuda_setDevice(device)torch._C._cuda_setDevice(device)

File "/opt/conda/lib/python3.8/site-packages/torch/cuda/init.py", line 217, in _lazy_init File "/opt/conda/lib/python3.8/site-packages/torch/cuda/init.py", line 217, in _lazy_init torch._C._cuda_init() RuntimeErrortorch._C._cuda_init(): Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 803: system has unsupported display driver / cuda driver combination RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 803: system has unsupported display driver / cuda driver combination ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 544) of binary: /opt/conda/bin/python Traceback (most recent call last): File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in main() File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run elastic_launch( File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

tools/train.py FAILED

Failures: [1]: time : 2023-07-10_14:12:42 host : workspace rank : 1 (local_rank: 1) exitcode : 1 (pid: 545) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2023-07-10_14:12:42 host : workspace rank : 0 (local_rank: 0) exitcode : 1 (pid: 544) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Additional information

Besides, if I resume using a single GPU but not the same one as before, the training can start normally, but appears to use CPU only.

Is it not supported to change GPUs for resuming, either number of GPU or type of GPU? Besides, what is the meaning of resume = True or False, and what is its relationship with the load_from setting?

gaotongxiao commented 1 year ago

Thanks for your feedback, I've forwarded this issue to MMEngine's team. But I personally think resuming the task with a different number of gpus would probably not be supported in the near future.

For the difference between resume and load_from, see https://mmocr.readthedocs.io/en/dev-1.x/user_guides/config.html#checkpoint-loading-configuration