[Bug] RealBasicVSR training Error : torch.distributed.elastic.multiprocessing.api:failed

gihwan-kim commented 1 year ago

Prerequisite

[X] I have searched Issues and Discussions but cannot get the expected help.
[X] I have read the FAQ documentation but cannot get the expected help.
[X] The bug has not been fixed in the latest version (master) or latest version (1.x).

Task

I'm using the official example scripts/configs for the officially supported tasks/models/datasets.

Branch

master branch https://github.com/open-mmlab/mmediting

Environment

sys.platform: linux Python: 3.7.11 (default, Jul 27 2021, 14:32:16) [GCC 7.5.0] CUDA available: True GPU 0: NVIDIA GeForce RTX 2080 Ti CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 11.2, V11.2.152 GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 PyTorch: 1.10.2 PyTorch compiling details: PyTorch built with:

GCC 7.3
C++ Version: 201402
Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
OpenMP 201511 (a.k.a. OpenMP 4.5)
LAPACK is enabled (usually provided by MKL)
NNPACK is enabled
CPU capability usage: AVX2
CUDA Runtime 11.3
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
CuDNN 8.2
Magma 2.5.2
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.11.3 OpenCV: 4.5.4 MMCV: 1.5.0 MMCV Compiler: GCC 7.3 MMCV CUDA Compiler: 11.3 MMEditing: 0.16.0+7b3a8bd

Reproduces the problem - code sample

I just training again.

Reproduces the problem - command or script

./tools/dist_train.sh ./configs/restorers/real_basicvsr/realbasicvsr_wogan_c64b20_2x30x8_lr1e-4_300k_reds.py 1

Reproduces the problem - error message

  File "./tools/train.py", line 169, in <module>
    main()
  File "./tools/train.py", line 165, in main
    meta=meta)
  File "/home/gihwan/mmedit/mmedit/apis/train.py", line 104, in train_model
    meta=meta)
  File "/home/gihwan/mmedit/mmedit/apis/train.py", line 241, in _dist_train
    runner.run(data_loaders, cfg.workflow, cfg.total_iters)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 134, in run
    iter_runner(iter_loaders[i], **kwargs)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 59, in train
    data_batch = next(data_loader)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 32, in __next__
    data = next(self.iter_loader)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
    data = self._next_data()
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data
    return self._process_data(data)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
    data.reraise()
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/_utils.py", line 434, in reraise
    raise exception
av.codec.codec.UnknownCodecError: Caught UnknownCodecError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/gihwan/mmedit/mmedit/datasets/dataset_wrappers.py", line 31, in __getitem__
    return self.dataset[idx % self._ori_len]
  File "/home/gihwan/mmedit/mmedit/datasets/base_sr_dataset.py", line 52, in __getitem__
    return self.pipeline(results)
  File "/home/gihwan/mmedit/mmedit/datasets/pipelines/compose.py", line 42, in __call__
    data = t(data)
  File "/home/gihwan/mmedit/mmedit/datasets/pipelines/random_degradations.py", line 547, in __call__
    results = degradation(results)
  File "/home/gihwan/mmedit/mmedit/datasets/pipelines/random_degradations.py", line 465, in __call__
    results[key] = self._apply_random_compression(results[key])
  File "/home/gihwan/mmedit/mmedit/datasets/pipelines/random_degradations.py", line 434, in _apply_random_compression
    stream = container.add_stream(codec, rate=1)
  File "av/container/output.pyx", line 64, in av.container.output.OutputContainer.add_stream
  File "av/codec/codec.pyx", line 184, in av.codec.codec.Codec.__cinit__
  File "av/codec/codec.pyx", line 193, in av.codec.codec.Codec._init
av.codec.codec.UnknownCodecError: libx264

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 10741) of binary: /home/gihwan/anaconda3/envs/openmmlab2/bin/python
Traceback (most recent call last):
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/distributed/run.py", line 713, in run
    )(*cmd_args)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

Additional information

I'm trying to train a Real BasicVSR to check if it trains in my environment.
I have similar issue like issue. But that issue isn't resolved yet.

LeoXing1996 commented 1 year ago

Hey @gihwan-kim, this seems to be a PyAV error, since av.codec.codec.UnknownCodecError: libx264 is raised. What PyAV version you used?

gihwan-kim commented 1 year ago

Hey @gihwan-kim, this seems to be a PyAV error, since av.codec.codec.UnknownCodecError: libx264 is raised. What PyAV version you used?

>>> import av
>>> av.__version__
'8.0.2'

Its ver 8.0.2

LeoXing1996 commented 1 year ago

@gihwan-kim, Can you attempt to install PyAV==8.0.3. BTW, what is your ffmpeg version?

gihwan-kim commented 1 year ago

@gihwan-kim, Can you attempt to install PyAV==8.0.3. BTW, what is your ffmpeg version?

Thank you for kind reply!. When i changed pyav version to 8.0.3, it dose not appear UnknownCodecError. But new issue is occured..;

and ffmpeg version is 4.3

RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 10.75 GiB total capacity; 9.28 GiB already allocated; 18.75 MiB free; 9.34 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 12491) of binary: /home/gihwan/anaconda3/envs/openmmlab2/bin/python
Traceback (most recent call last):
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/distributed/run.py", line 713, in run
    )(*cmd_args)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

I think too much iteration need large memory. Should i have to change training configuration?

This is my output of nvidia-smi command.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03   Driver Version: 470.141.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| 35%   29C    P8     1W / 260W |    102MiB / 11011MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1116      G   /usr/lib/xorg/Xorg                 39MiB |
|    0   N/A  N/A      1243      G   /usr/bin/gnome-shell               60MiB |

LeoXing1996 commented 1 year ago

@gihwan-kim, training RealBasicVSR with default config needs at least 17201MB of GPU memory, and you can refer to the memory field in log.

I think you can try to change crop_size in the training pipeline to a smaller value to save memory.

gihwan-kim commented 1 year ago

@gihwan-kim, training RealBasicVSR with default config needs at least 17201MB of GPU memory, and you can refer to the memory field in log.

I think you can try to change crop_size in the training pipeline to a smaller value to save memory.

I could solve by changing workers_per_gpu, samples_per_gpu, num_input_frames values in cofig file. Thank you! But, while training, i found "file not found error"

FileNotFoundError: Caught FileNotFoundError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/gihwan/mmedit/mmedit/datasets/base_sr_dataset.py", line 52, in __getitem__
    return self.pipeline(results)
  File "/home/gihwan/mmedit/mmedit/datasets/pipelines/compose.py", line 42, in __call__
    data = t(data)
  File "/home/gihwan/mmedit/mmedit/datasets/pipelines/loading.py", line 176, in __call__
    img_bytes = self.file_client.get(filepath)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/mmcv/fileio/file_client.py", line 993, in get
    return self.client.get(filepath)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/mmcv/fileio/file_client.py", line 518, in get
    with open(filepath, 'rb') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'data/UDM10/BIx4/archpeople/00000000.png'

My UDM data file path is 'data/UDM10/BIx4/archpeople/000.png'. Is there any naming rule or guide about preprocessing UDM10 file to BIx4 in RealBasicVSR? I found REDS and videoLQ guide in paper and official document. But i couldn't find where to download and pre processing UDM10 data. I just found where to down load from this link udm10

Z-Fran commented 1 year ago

@gihwan-kim For the master branch, you need to rename your images of datasets. You can use a simple script to resolve it. Like this:

    data_root = 'dataset/data/udm10/'
    save_root = 'dataset/data/UDM10/'
    dirs = os.listdir(data_root)
    dirs = sorted(dirs, key=str.lower)
    num = 0
    for _dir in dirs:
        sub_root1 = save_root+'GT/'+str(num).zfill(8)
        sub_root2 = save_root+'BDx4/'+str(num).zfill(8)
        os.system('cp -r '+data_root+_dir+'/truth/ '+sub_root1)
        os.system('cp -r '+data_root+_dir+'/blur4/ '+sub_root2)
        num+=1

For the 1.x or dev-1.x branch, if your UDM data file path is 'data/UDM10/BIx4/archpeople/000.png', you can simply add a parameter like filename_tmpl='{:03d}.png'. You can reference https://github.com/open-mmlab/mmediting/blob/dev-1.x/configs/real_basicvsr/realbasicvsr_wogan-c64b20-2x30x8_8xb2-lr1e-4-300k_reds.py#L204

gihwan-kim commented 1 year ago

@gihwan-kim For the master branch, you need to rename your images of datasets. You can use a simple script to resolve it. Like this:
    data_root = 'dataset/data/udm10/'
    save_root = 'dataset/data/UDM10/'
    dirs = os.listdir(data_root)
    dirs = sorted(dirs, key=str.lower)
    num = 0
    for _dir in dirs:
        sub_root1 = save_root+'GT/'+str(num).zfill(8)
        sub_root2 = save_root+'BDx4/'+str(num).zfill(8)
        os.system('cp -r '+data_root+_dir+'/truth/ '+sub_root1)
        os.system('cp -r '+data_root+_dir+'/blur4/ '+sub_root2)
        num+=1
For the 1.x or dev-1.x branch, if your UDM data file path is 'data/UDM10/BIx4/archpeople/000.png', you can simply add a parameter like filename_tmpl='{:03d}.png'. You can reference https://github.com/open-mmlab/mmediting/blob/dev-1.x/configs/real_basicvsr/realbasicvsr_wogan-c64b20-2x30x8_8xb2-lr1e-4-300k_reds.py#L204

Thank you for kindness help! As i mentioned. I have question about validation data set UDM10 in RealBasicVSR. I downloaded udm10 dataset from this link udm10 download site . This site's udm10 data directory structure is

./udm10
├── archpeople
│   ├── blur4
│   └── truth
├── archwall
│   ├── blur4
│   └── truth
├── auditorium
│   ├── blur4
│   └── truth
├── band
│   ├── blur4
│   └── truth
├── caffe
│   ├── blur4
│   └── truth
├── camera
│   ├── blur4
│   └── truth
├── clap
│   ├── blur4
│   └── truth
├── lake
│   ├── blur4
│   └── truth
├── photography
│   ├── blur4
│   └── truth
└── polyflow
    ├── blur4
    └── truth

Is it blur4 data is BIx4 ? Or should i have to pre-processing like this link link ?

And Blx4 mean bicubic interpolation x4 downsampling?

Z-Fran commented 1 year ago

Blur4 is not BIx4 or BDx4. BIx4 and BDx4 are both pre-processed using MATLAB. For BDx4, you need to use MATLAB script https://github.com/ckkelvinchan/BasicVSR-IconVSR/blob/main/BD_degradation.m . For BIx4, you can simply use imresize of MATLAB to get data or imresize of python implementation like this https://github.com/fatheral/matlab_imresize/blob/master/imresize.py . I can provide my data if you need it. And Blx4 mean bicubic interpolation x4 downsampling. @gihwan-kim

gihwan-kim commented 1 year ago

Blur4 is not BIx4 or BDx4. BIx4 and BDx4 are both pre-processed using MATLAB. For BDx4, you need to use MATLAB script https://github.com/ckkelvinchan/BasicVSR-IconVSR/blob/main/BD_degradation.m . For BIx4, you can simply use imresize of MATLAB to get data or imresize of python implementation like this https://github.com/fatheral/matlab_imresize/blob/master/imresize.py . I can provide my data if you need it. And Blx4 mean bicubic interpolation x4 downsampling. @gihwan-kim

I will check imresize code what you mentioned thank you! :) If you okay. It would be grateful if you could send your data.

Z-Fran commented 1 year ago

https://drive.google.com/file/d/1G4V4KZZhhfzUlqHiSBBuWyqLyIOvOs0W/view?usp=share_link @gihwan-kim

gihwan-kim commented 1 year ago

https://drive.google.com/file/d/1G4V4KZZhhfzUlqHiSBBuWyqLyIOvOs0W/view?usp=share_link @gihwan-kim

Thank you.!

open-mmlab / mmagic