Open Daisy5296 opened 1 year ago
Thanks for your feedback, I've forwarded this issue to MMEngine's team. But I personally think resuming the task with a different number of gpus would probably not be supported in the near future.
For the difference between resume
and load_from
, see https://mmocr.readthedocs.io/en/dev-1.x/user_guides/config.html#checkpoint-loading-configuration
Prerequisite
Task
I have modified the scripts/configs, or I'm working on my own tasks/models/datasets.
Branch
main branch https://github.com/open-mmlab/mmocr
Environment
sys.platform: linux Python: 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:10) [GCC 10.3.0] CUDA available: True numpy_random_seed: 2147483648 GPU 0,1: Tesla P100-PCIE-16GB CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 11.7, V11.7.99 GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 PyTorch: 1.13.0a0+340c412 PyTorch compiling details: PyTorch built with:
GCC 9.4 C++ Version: 201402 Intel(R) Math Kernel Library Version 2020.0.4 Product Build 20200917 for Intel(R) 64 architecture applications Intel(R) MKL-DNN v2.6.0 (Git Hash N/A) OpenMP 201511 (a.k.a. OpenMP 4.5) LAPACK is enabled (usually provided by MKL) NNPACK is enabled CPU capability usage: AVX512 CUDA Runtime 11.7 NVCC architecture flags: -gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_86,code=compute_86 CuDNN 8.4.1 (built against CUDA 11.6) Magma 2.6.2 Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.7, CUDNN_VERSION=8.4.1, CXX_COMPILER=/usr/bin/c++, CXX_FLAGS=-fno-gnu-unique -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.13.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=ON, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, TorchVision: 0.13.0a0 OpenCV: 4.2.0 MMEngine: 0.7.2 MMOCR: 1.0.0+d7c59f3
Reproduces the problem - code sample
Here is my config file:
base = [ '_base_dbnet_resnet18_fpnc.py', '../base/datasets/icdar2015.py', '../base/datasets/rects.py', '../base/default_runtime.py', '../base/schedules/schedule_sgd_1200e.py', ]
train_list = [base.icdar2015_textdet_train, base.rects_textdet_train] test_list = [base.icdar2015_textdet_test, base.rects_textdet_test]
train_dataset = dict(type='ConcatDataset', datasets=train_list, pipeline=base.train_pipeline) test_dataset = dict(type='ConcatDataset', datasets=test_list, pipeline=base.test_pipeline)
train_dataloader = dict( batch_size=32, num_workers=8, persistent_workers=True, sampler=dict(type='DefaultSampler', shuffle=True), dataset=train_dataset)
val_dataloader = dict( batch_size=1, num_workers=4, persistent_workers=True, sampler=dict(type='DefaultSampler', shuffle=False), dataset=test_dataset)
test_dataloader = val_dataloader
load_from = '/dfs/data/mmocr/work_dirs/epoch80.pth' resume = False train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=3000, val_interval=20) optim_wrapper = dict( optimizer=dict(type='SGD', lr=0.0007, momentum=0.9, weight_decay=0.00001))
auto_scale_lr = dict(base_batch_size=32)
Reproduces the problem - command or script
I started training using the following command: CUDA_VISIBLE_DEVICES=0 python tools/train.py configs/textdet/dbnet/dbnet_resnet18_fpnc_1200e_rects_icdar2015.py
and then I resume training to use 2 GPUs using: tools/dist_train.sh configs/textdet/dbnet/dbnet_resnet18_fpnc_1200e_rects_icdar2015.py 2
Reproduces the problem - error message
The following error occurs when I change to 2 GPU: /opt/conda/lib/python3.8/site-packages/torch/cuda/init.py:83: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 803: system has unsupported display driver / cuda driver combination (Triggered internally at /opt/pytorch/pytorch/c10/cuda/CUDAFunctions.cpp:109.) return torch._C._cuda_getDeviceCount() > 0 /opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects
--local_rank
argument to be set, please change it to read fromos.environ['LOCAL_RANK']
instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructionswarnings.warn( WARNING:torch.distributed.run:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
/opt/conda/lib/python3.8/site-packages/torch/cuda/init.py:83: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 803: system has unsupported display driver / cuda driver combination (Triggered internally at /opt/pytorch/pytorch/c10/cuda/CUDAFunctions.cpp:109.) return torch._C._cuda_getDeviceCount() > 0 /opt/conda/lib/python3.8/site-packages/torch/cuda/init.py:83: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 803: system has unsupported display driver / cuda driver combination (Triggered internally at /opt/pytorch/pytorch/c10/cuda/CUDAFunctions.cpp:109.) return torch._C._cuda_getDeviceCount() > 0 /opt/conda/lib/python3.8/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. warnings.warn( /opt/conda/lib/python3.8/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. warnings.warn( Traceback (most recent call last): Traceback (most recent call last): File "tools/train.py", line 114, in
File "tools/train.py", line 114, in
main()
File "tools/train.py", line 103, in main
main()
File "tools/train.py", line 103, in main
runner = Runner.from_cfg(cfg)
File "/opt/conda/lib/python3.8/site-packages/mmengine/runner/runner.py", line 439, in from_cfg
runner = Runner.from_cfg(cfg)
File "/opt/conda/lib/python3.8/site-packages/mmengine/runner/runner.py", line 439, in from_cfg
runner = cls(
File "/opt/conda/lib/python3.8/site-packages/mmengine/runner/runner.py", line 353, in init
runner = cls(
File "/opt/conda/lib/python3.8/site-packages/mmengine/runner/runner.py", line 353, in init
self.setup_env(env_cfg)
File "/opt/conda/lib/python3.8/site-packages/mmengine/runner/runner.py", line 652, in setup_env
self.setup_env(env_cfg)
File "/opt/conda/lib/python3.8/site-packages/mmengine/runner/runner.py", line 652, in setup_env
init_dist(self.launcher, dist_cfg)
File "/opt/conda/lib/python3.8/site-packages/mmengine/dist/utils.py", line 70, in init_dist
init_dist(self.launcher, dist_cfg)
File "/opt/conda/lib/python3.8/site-packages/mmengine/dist/utils.py", line 70, in init_dist
_init_dist_pytorch(backend, kwargs)
_init_dist_pytorch(backend, kwargs) File "/opt/conda/lib/python3.8/site-packages/mmengine/dist/utils.py", line 107, in _init_dist_pytorch File "/opt/conda/lib/python3.8/site-packages/mmengine/dist/utils.py", line 107, in _init_dist_pytorch torch.cuda.set_device(local_rank)torch.cuda.set_device(local_rank)
File "/opt/conda/lib/python3.8/site-packages/torch/cuda/init.py", line 314, in set_device File "/opt/conda/lib/python3.8/site-packages/torch/cuda/init.py", line 314, in set_device torch._C._cuda_setDevice(device)torch._C._cuda_setDevice(device)
File "/opt/conda/lib/python3.8/site-packages/torch/cuda/init.py", line 217, in _lazy_init File "/opt/conda/lib/python3.8/site-packages/torch/cuda/init.py", line 217, in _lazy_init torch._C._cuda_init() RuntimeErrortorch._C._cuda_init(): Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 803: system has unsupported display driver / cuda driver combination RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 803: system has unsupported display driver / cuda driver combination ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 544) of binary: /opt/conda/bin/python Traceback (most recent call last): File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
tools/train.py FAILED
Failures: [1]: time : 2023-07-10_14:12:42 host : workspace rank : 1 (local_rank: 1) exitcode : 1 (pid: 545) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure): [0]: time : 2023-07-10_14:12:42 host : workspace rank : 0 (local_rank: 0) exitcode : 1 (pid: 544) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Additional information
Besides, if I resume using a single GPU but not the same one as before, the training can start normally, but appears to use CPU only.
Is it not supported to change GPUs for resuming, either number of GPU or type of GPU? Besides, what is the meaning of resume = True or False, and what is its relationship with the load_from setting?