kan-bayashi / ParallelWaveGAN

Unofficial Parallel WaveGAN (+ MelGAN & Multi-band MelGAN & HiFi-GAN & StyleMelGAN) with Pytorch
https://kan-bayashi.github.io/ParallelWaveGAN/
MIT License
1.54k stars 339 forks source link

The program crashed during training #400

Closed zzxiang closed 1 year ago

zzxiang commented 1 year ago

Crash Description

The following exception is thrown in train.log. Seems that the training program crashed because the wav length of training data is too short. (I set remove_short_samples to false in parallel_wavegan.v1.yaml.)

But the same config and data doesn't crash if n_gpus is set to 1 in run.sh, so it may have something to do with distributed computing.

2023-03-15 12:19:52,873 (train:317) INFO: (Steps: 5542) Finished 326 epoch training (17 steps per epoch).
[train]:   3%|▎         | 5559/200000 [1:40:28<50:52:14,  1.06it/s]/home/sshu/pwg/parallel_wavegan/bin/train.py:600: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at  ../torch/csrc/utils/tensor_new.cpp:210.)
  y_batch = torch.tensor(y_batch, dtype=torch.float).unsqueeze(1)  # (B, 1, T)
/home/sshu/pwg/parallel_wavegan/bin/train.py:600: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at  ../torch/csrc/utils/tensor_new.cpp:210.)
  y_batch = torch.tensor(y_batch, dtype=torch.float).unsqueeze(1)  # (B, 1, T)
/home/sshu/pwg/parallel_wavegan/bin/train.py:600: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at  ../torch/csrc/utils/tensor_new.cpp:210.)
  y_batch = torch.tensor(y_batch, dtype=torch.float).unsqueeze(1)  # (B, 1, T)
/home/sshu/pwg/parallel_wavegan/bin/train.py:600: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at  ../torch/csrc/utils/tensor_new.cpp:210.)
  y_batch = torch.tensor(y_batch, dtype=torch.float).unsqueeze(1)  # (B, 1, T)
2023-03-15 12:20:11,153 (train:317) INFO: (Steps: 5559) Finished 327 epoch training (17 steps per epoch).
[train]:   3%|▎         | 5575/200000 [1:40:46<61:09:19,  1.13s/it]2023-03-15 12:20:29,317 (train:1087) INFO: Successfully saved checkpoint @ 5575steps.
Traceback (most recent call last):
  File "/home/sshu/pwg/tools/venv/bin/parallel-wavegan-train", line 11, in <module>
    load_entry_point('parallel-wavegan', 'console_scripts', 'parallel-wavegan-train')()
  File "/home/sshu/pwg/parallel_wavegan/bin/train.py", line 1082, in main
    trainer.run()
  File "/home/sshu/pwg/parallel_wavegan/bin/train.py", line 96, in run
    self._train_epoch()
  File "/home/sshu/pwg/parallel_wavegan/bin/train.py", line 300, in _train_epoch
    for train_steps_per_epoch, batch in enumerate(self.data_loader["train"], 1):
  File "/home/sshu/pwg/tools/venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 530, in __next__
    data = self._next_data()
  File "/home/sshu/pwg/tools/venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1224, in _next_data
    return self._process_data(data)
  File "/home/sshu/pwg/tools/venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1250, in _process_data
    data.reraise()
  File "/home/sshu/pwg/tools/venv/lib/python3.8/site-packages/torch/_utils.py", line 457, in reraise
    raise exception
IndexError: Caught IndexError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/sshu/pwg/tools/venv/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/sshu/pwg/tools/venv/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
    return self.collate_fn(data)
  File "/home/sshu/pwg/parallel_wavegan/bin/train.py", line 601, in __call__
    c_batch = torch.tensor(c_batch, dtype=torch.float).transpose(2, 1)  # (B, C, T')
IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 2)

[train]:   3%|▎         | 5575/200000 [1:40:47<58:34:56,  1.08s/it]Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/sshu/pwg/parallel_wavegan/distributed/launch.py", line 177, in <module>
    main()
  File "/home/sshu/pwg/parallel_wavegan/distributed/launch.py", line 173, in main
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['parallel-wavegan-train', '--local_rank=1', '--config', 'conf/parallel_wavegan.v1.yaml', '--train-dumpdir', 'dump/train_nodev/norm', '--dev-dumpdir', 'dump/dev/norm', '--outdir', 'exp/train_nodev_parallel_wavegan.v1_train_nodev_jsut_parallel_wavegan.v1', '--resume', '', '--pretrain', '/home/sshu/pwg_out/train_nodev_jsut_parallel_wavegan.v1/checkpoint-400000steps.pkl', '--verbose', '1']' returned non-zero exit status 1.
# Accounting: time=6088 threads=1
# Ended (code 1) at Wed Mar 15 12:20:30 JST 2023, elapsed time 6088 seconds

Training Description

The data contains 200 wav files, with 198 files set as training set, and the remaining 2 files set as both dev and eval.

n_gpus is set to 2 since I've got 2 GPUs.

Here's the relating parameters in run.sh

n_gpus=2
n_jobs=2

# ...

# subset setting
shuffle=false # whether to shuffle the data to create subset
num_dev=2   # the number of development data
num_eval=0  # the number of evaluation data
              # (if set to 0, the same dev set is used as eval set)

I used the parallel_wavegan.v1.yaml which comes from the pretrained JSUT pwg model, with remove_short_samples set to false.

Environment

ParrallelWaveGAN: v0.5.5 Python: 3.8.10

The output of pip list:

Package            Version      Editable project location
------------------ ------------ -------------------------
apex               0.1
appdirs            1.4.4
argcomplete        2.0.0
attrs              21.4.0
audioread          2.1.9
beautifulsoup4     4.11.1
black              22.3.0
certifi            2021.10.8
cffi               1.15.0
charset-normalizer 2.0.12
click              8.1.2
cycler             0.11.0
decorator          5.1.1
filelock           3.6.0
flake8             3.8.4
flake8-docstrings  1.6.0
fonttools          4.33.1
gdown              4.4.0
h5py               3.6.0
hacking            4.1.0
idna               3.3
iniconfig          1.1.1
joblib             1.1.0
kaldiio            2.17.2
kiwisolver         1.4.2
librosa            0.9.1
llvmlite           0.38.0
matplotlib         3.5.1
mccabe             0.6.1
mypy-extensions    0.4.3
numba              0.55.1
numpy              1.21.6
packaging          21.3
parallel-wavegan   0.5.5        /home/sshu/pwg
pathspec           0.9.0
Pillow             9.1.0
pip                22.3.1
pkg_resources      0.0.0
platformdirs       2.5.2
pluggy             1.0.0
pooch              1.6.0
protobuf           3.20.1
py                 1.11.0
pycodestyle        2.6.0
pycparser          2.21
pydocstyle         6.1.1
pyflakes           2.2.0
pyparsing          3.0.8
PySocks            1.7.1
pytest             7.1.1
python-dateutil    2.8.2
PyYAML             6.0
requests           2.27.1
resampy            0.2.2
scikit-learn       1.0.2
scipy              1.8.0
setuptools         44.0.0
six                1.16.0
snowballstemmer    2.2.0
SoundFile          0.10.3.post1
soupsieve          2.3.2.post1
tensorboardX       2.5
threadpoolctl      3.1.0
toml               0.10.2
tomli              2.0.1
torch              1.11.0+cu113
tqdm               4.64.0
typing_extensions  4.2.0
urllib3            1.26.9
xmltodict          0.12.0
yq                 2.14.0

The output of nvidia-smi

$ nvidia-smi
Wed Mar 15 13:56:26 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:04.0 Off |                    0 |
| N/A   47C    P8    10W /  70W |      2MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            On   | 00000000:00:05.0 Off |                    0 |
| N/A   58C    P8    10W /  70W |      2MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

The output of nvcc -V:

$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_Mar__8_18:18:20_PST_2022
Cuda compilation tools, release 11.6, V11.6.124
Build cuda_11.6.r11.6/compiler.31057947_0
kan-bayashi commented 1 year ago

Thank you for your detailed report. This crash may be caused when all of samples in batch are shorter than batch_max_steps. And when using multi-gpu, each gpu's batch size will be batch size / #gpus, i.e., the acutual batch size in each gpu will be decreased. Therefore, the issue more likely to occur. So in conclusion, this is expected behavior.

To solve this problem, please use remove_short_samples=true or reducing batch_max_steps.

zzxiang commented 1 year ago

Thanks for your quick reply!

I'll try reducing batch_max_steps since the data is already small and I want to keep all the data as possible.

But I am still a little curious about two more questions.

  1. Does reducing batch_max_steps cause a bad effect on training accuracy?

  2. The crash actually happened after 326 epoch training has finished. But if a gpu batch size being shorter than batch_max_steps will definitely cause a crash, why doesn't the crash happen at the first epoch?

kan-bayashi commented 1 year ago

Does reducing batch_max_steps cause a bad effect on training accuracy?

In my experiences, the effect is limited (I tried halved or double). But too short may not be suitable.

But if a gpu batch size being shorter than batch_max_steps will definitely cause a crash, why doesn't the crash happen at the first epoch?

This is because in each epoch the data is shuffled and create new batches. The crash will be only happened when unfortunately a batch contains only short samples.

zzxiang commented 1 year ago

Thank you very much! Very easy to understand.