open-mmlab / mmagic

OpenMMLab Multimodal Advanced, Generative, and Intelligent Creation Toolbox. Unlock the magic πŸͺ„: Generative-AI (AIGC), easy-to-use APIs, awsome model zoo, diffusion models, for text-to-image generation, image/video restoration/enhancement, etc.
https://mmagic.readthedocs.io/en/latest/
Apache License 2.0
6.89k stars 1.06k forks source link

[Bug] RealBasicVSR training Error : torch.distributed.elastic.multiprocessing.api:failed #1478

Closed gihwan-kim closed 1 year ago

gihwan-kim commented 1 year ago

Prerequisite

Task

I'm using the official example scripts/configs for the officially supported tasks/models/datasets.

Branch

master branch https://github.com/open-mmlab/mmediting

Environment

sys.platform: linux Python: 3.7.11 (default, Jul 27 2021, 14:32:16) [GCC 7.5.0] CUDA available: True GPU 0: NVIDIA GeForce RTX 2080 Ti CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 11.2, V11.2.152 GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 PyTorch: 1.10.2 PyTorch compiling details: PyTorch built with:

TorchVision: 0.11.3 OpenCV: 4.5.4 MMCV: 1.5.0 MMCV Compiler: GCC 7.3 MMCV CUDA Compiler: 11.3 MMEditing: 0.16.0+7b3a8bd

Reproduces the problem - code sample

I just training again.

Reproduces the problem - command or script

./tools/dist_train.sh ./configs/restorers/real_basicvsr/realbasicvsr_wogan_c64b20_2x30x8_lr1e-4_300k_reds.py 1

Reproduces the problem - error message

  File "./tools/train.py", line 169, in <module>
    main()
  File "./tools/train.py", line 165, in main
    meta=meta)
  File "/home/gihwan/mmedit/mmedit/apis/train.py", line 104, in train_model
    meta=meta)
  File "/home/gihwan/mmedit/mmedit/apis/train.py", line 241, in _dist_train
    runner.run(data_loaders, cfg.workflow, cfg.total_iters)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 134, in run
    iter_runner(iter_loaders[i], **kwargs)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 59, in train
    data_batch = next(data_loader)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 32, in __next__
    data = next(self.iter_loader)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
    data = self._next_data()
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data
    return self._process_data(data)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
    data.reraise()
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/_utils.py", line 434, in reraise
    raise exception
av.codec.codec.UnknownCodecError: Caught UnknownCodecError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/gihwan/mmedit/mmedit/datasets/dataset_wrappers.py", line 31, in __getitem__
    return self.dataset[idx % self._ori_len]
  File "/home/gihwan/mmedit/mmedit/datasets/base_sr_dataset.py", line 52, in __getitem__
    return self.pipeline(results)
  File "/home/gihwan/mmedit/mmedit/datasets/pipelines/compose.py", line 42, in __call__
    data = t(data)
  File "/home/gihwan/mmedit/mmedit/datasets/pipelines/random_degradations.py", line 547, in __call__
    results = degradation(results)
  File "/home/gihwan/mmedit/mmedit/datasets/pipelines/random_degradations.py", line 465, in __call__
    results[key] = self._apply_random_compression(results[key])
  File "/home/gihwan/mmedit/mmedit/datasets/pipelines/random_degradations.py", line 434, in _apply_random_compression
    stream = container.add_stream(codec, rate=1)
  File "av/container/output.pyx", line 64, in av.container.output.OutputContainer.add_stream
  File "av/codec/codec.pyx", line 184, in av.codec.codec.Codec.__cinit__
  File "av/codec/codec.pyx", line 193, in av.codec.codec.Codec._init
av.codec.codec.UnknownCodecError: libx264

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 10741) of binary: /home/gihwan/anaconda3/envs/openmmlab2/bin/python
Traceback (most recent call last):
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/distributed/run.py", line 713, in run
    )(*cmd_args)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

Additional information

I'm trying to train a Real BasicVSR to check if it trains in my environment.
I have similar issue like issue. But that issue isn't resolved yet.

LeoXing1996 commented 1 year ago

Hey @gihwan-kim, this seems to be a PyAV error, since av.codec.codec.UnknownCodecError: libx264 is raised. What PyAV version you used?

gihwan-kim commented 1 year ago

Hey @gihwan-kim, this seems to be a PyAV error, since av.codec.codec.UnknownCodecError: libx264 is raised. What PyAV version you used?

>>> import av
>>> av.__version__
'8.0.2'

Its ver 8.0.2

LeoXing1996 commented 1 year ago

@gihwan-kim, Can you attempt to install PyAV==8.0.3. BTW, what is your ffmpeg version?

gihwan-kim commented 1 year ago

@gihwan-kim, Can you attempt to install PyAV==8.0.3. BTW, what is your ffmpeg version?

Thank you for kind reply!. When i changed pyav version to 8.0.3, it dose not appear UnknownCodecError. But new issue is occured..;

and ffmpeg version is 4.3

RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 10.75 GiB total capacity; 9.28 GiB already allocated; 18.75 MiB free; 9.34 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 12491) of binary: /home/gihwan/anaconda3/envs/openmmlab2/bin/python
Traceback (most recent call last):
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/distributed/run.py", line 713, in run
    )(*cmd_args)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

I think too much iteration need large memory. Should i have to change training configuration?

This is my output of nvidia-smi command.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03   Driver Version: 470.141.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| 35%   29C    P8     1W / 260W |    102MiB / 11011MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1116      G   /usr/lib/xorg/Xorg                 39MiB |
|    0   N/A  N/A      1243      G   /usr/bin/gnome-shell               60MiB |
LeoXing1996 commented 1 year ago

@gihwan-kim, training RealBasicVSR with default config needs at least 17201MB of GPU memory, and you can refer to the memory field in log.

I think you can try to change crop_size in the training pipeline to a smaller value to save memory.

gihwan-kim commented 1 year ago

@gihwan-kim, training RealBasicVSR with default config needs at least 17201MB of GPU memory, and you can refer to the memory field in log.

I think you can try to change crop_size in the training pipeline to a smaller value to save memory.

I could solve by changing workers_per_gpu, samples_per_gpu, num_input_frames values in cofig file. Thank you! But, while training, i found "file not found error"

FileNotFoundError: Caught FileNotFoundError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/gihwan/mmedit/mmedit/datasets/base_sr_dataset.py", line 52, in __getitem__
    return self.pipeline(results)
  File "/home/gihwan/mmedit/mmedit/datasets/pipelines/compose.py", line 42, in __call__
    data = t(data)
  File "/home/gihwan/mmedit/mmedit/datasets/pipelines/loading.py", line 176, in __call__
    img_bytes = self.file_client.get(filepath)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/mmcv/fileio/file_client.py", line 993, in get
    return self.client.get(filepath)
  File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/mmcv/fileio/file_client.py", line 518, in get
    with open(filepath, 'rb') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'data/UDM10/BIx4/archpeople/00000000.png'

My UDM data file path is 'data/UDM10/BIx4/archpeople/000.png'. Is there any naming rule or guide about preprocessing UDM10 file to BIx4 in RealBasicVSR? I found REDS and videoLQ guide in paper and official document. But i couldn't find where to download and pre processing UDM10 data. I just found where to down load from this link udm10

Z-Fran commented 1 year ago

@gihwan-kim For the master branch, you need to rename your images of datasets. You can use a simple script to resolve it. Like this:

    data_root = 'dataset/data/udm10/'
    save_root = 'dataset/data/UDM10/'
    dirs = os.listdir(data_root)
    dirs = sorted(dirs, key=str.lower)
    num = 0
    for _dir in dirs:
        sub_root1 = save_root+'GT/'+str(num).zfill(8)
        sub_root2 = save_root+'BDx4/'+str(num).zfill(8)
        os.system('cp -r '+data_root+_dir+'/truth/ '+sub_root1)
        os.system('cp -r '+data_root+_dir+'/blur4/ '+sub_root2)
        num+=1

For the 1.x or dev-1.x branch, if your UDM data file path is 'data/UDM10/BIx4/archpeople/000.png', you can simply add a parameter like filename_tmpl='{:03d}.png'. You can reference https://github.com/open-mmlab/mmediting/blob/dev-1.x/configs/real_basicvsr/realbasicvsr_wogan-c64b20-2x30x8_8xb2-lr1e-4-300k_reds.py#L204

gihwan-kim commented 1 year ago

@gihwan-kim For the master branch, you need to rename your images of datasets. You can use a simple script to resolve it. Like this:

    data_root = 'dataset/data/udm10/'
    save_root = 'dataset/data/UDM10/'
    dirs = os.listdir(data_root)
    dirs = sorted(dirs, key=str.lower)
    num = 0
    for _dir in dirs:
        sub_root1 = save_root+'GT/'+str(num).zfill(8)
        sub_root2 = save_root+'BDx4/'+str(num).zfill(8)
        os.system('cp -r '+data_root+_dir+'/truth/ '+sub_root1)
        os.system('cp -r '+data_root+_dir+'/blur4/ '+sub_root2)
        num+=1

For the 1.x or dev-1.x branch, if your UDM data file path is 'data/UDM10/BIx4/archpeople/000.png', you can simply add a parameter like filename_tmpl='{:03d}.png'. You can reference https://github.com/open-mmlab/mmediting/blob/dev-1.x/configs/real_basicvsr/realbasicvsr_wogan-c64b20-2x30x8_8xb2-lr1e-4-300k_reds.py#L204

Thank you for kindness help! As i mentioned. I have question about validation data set UDM10 in RealBasicVSR. I downloaded udm10 dataset from this link udm10 download site . This site's udm10 data directory structure is

./udm10
β”œβ”€β”€ archpeople
β”‚Β Β  β”œβ”€β”€ blur4
β”‚Β Β  └── truth
β”œβ”€β”€ archwall
β”‚Β Β  β”œβ”€β”€ blur4
β”‚Β Β  └── truth
β”œβ”€β”€ auditorium
β”‚Β Β  β”œβ”€β”€ blur4
β”‚Β Β  └── truth
β”œβ”€β”€ band
β”‚Β Β  β”œβ”€β”€ blur4
β”‚Β Β  └── truth
β”œβ”€β”€ caffe
β”‚Β Β  β”œβ”€β”€ blur4
β”‚Β Β  └── truth
β”œβ”€β”€ camera
β”‚Β Β  β”œβ”€β”€ blur4
β”‚Β Β  └── truth
β”œβ”€β”€ clap
β”‚Β Β  β”œβ”€β”€ blur4
β”‚Β Β  └── truth
β”œβ”€β”€ lake
β”‚Β Β  β”œβ”€β”€ blur4
β”‚Β Β  └── truth
β”œβ”€β”€ photography
β”‚Β Β  β”œβ”€β”€ blur4
β”‚Β Β  └── truth
└── polyflow
    β”œβ”€β”€ blur4
    └── truth

Is it blur4 data is BIx4 ? Or should i have to pre-processing like this link link ?

And Blx4 mean bicubic interpolation x4 downsampling?

Z-Fran commented 1 year ago

Blur4 is not BIx4 or BDx4. BIx4 and BDx4 are both pre-processed using MATLAB. For BDx4, you need to use MATLAB script https://github.com/ckkelvinchan/BasicVSR-IconVSR/blob/main/BD_degradation.m . For BIx4, you can simply use imresize of MATLAB to get data or imresize of python implementation like this https://github.com/fatheral/matlab_imresize/blob/master/imresize.py . I can provide my data if you need it. And Blx4 mean bicubic interpolation x4 downsampling. @gihwan-kim

gihwan-kim commented 1 year ago

Blur4 is not BIx4 or BDx4. BIx4 and BDx4 are both pre-processed using MATLAB. For BDx4, you need to use MATLAB script https://github.com/ckkelvinchan/BasicVSR-IconVSR/blob/main/BD_degradation.m . For BIx4, you can simply use imresize of MATLAB to get data or imresize of python implementation like this https://github.com/fatheral/matlab_imresize/blob/master/imresize.py . I can provide my data if you need it. And Blx4 mean bicubic interpolation x4 downsampling. @gihwan-kim

I will check imresize code what you mentioned thank you! :) If you okay. It would be grateful if you could send your data.

Z-Fran commented 1 year ago

https://drive.google.com/file/d/1G4V4KZZhhfzUlqHiSBBuWyqLyIOvOs0W/view?usp=share_link @gihwan-kim

gihwan-kim commented 1 year ago

https://drive.google.com/file/d/1G4V4KZZhhfzUlqHiSBBuWyqLyIOvOs0W/view?usp=share_link @gihwan-kim

Thank you.!