Getting error "RuntimeError: CUDA error: device-side assert triggered" in epoch 0 on WaveGrad vocoder training

thorstenMueller commented 3 years ago

First things first: Thanks @nmstoker for providing great gatherup tool :+1: .

I'm trying to train WaveGrad vocoder model for use with our pretrained taco2 thorsten model (ljspeech structure)
Using NVidia xavier AGX machine (ubuntu aarch64) for training
Copied and adjusted "wavegrad_libritts.json" to match our taco2 audio settings
I've uploaded taco2 config (irrelevant for wavegrad training, or?) and vocoder config used for training. wavegrad-thorsten-conf.zip

Details

While in epoch 0 i get following error:

python ./TTS/bin/train_vocoder_wavegrad.py --config_path ./TTS/vocoder/configs/wavegrad_thorsten.json 
 > Using CUDA:  True
 > Number of GPUs:  1
   >  Mixed precision is enabled
 > Git Hash: ac46c3f
 > Experiment folder: /home/thorsten/___prj/tts/models/vocoder/wavegrad/mozilla/wavegrad-model-output/wavegrad-thorsten-November-30-2020_01+03PM-ac46c3f
 > Loading wavs from: /home/thorsten/___prj/tts/datasets/thorsten-de_v02/
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:1024
 | > power:None
 | > preemphasis:0.0
 | > griffin_lim_iters:None
 | > signal_norm:True
 | > symmetric_norm:True
 | > mel_fmin:0
 | > mel_fmax:8000.0
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:1.0
 | > clip_norm:True
 | > do_trim_silence:True
 | > trim_db:60
 | > do_sound_norm:False
 | > stats_path:/home/thorsten/___prj/tts/models/vocoder/wavegrad/mozilla/taco2-files/taco2-scale_stats.npy
 | > hop_length:256
 | > win_length:1024
 > Generator Model: wavegrad
 > WaveGrad has 15827106 parameters

 > EPOCH: 0/10000

 > TRAINING (2020-11-30 13:04:15) 
/media/nvidia/WD_NVME/PyTorch/JetPack_4.4/GA/pytorch-v1.6.0/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [47,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
 ! Run is removed from /home/thorsten/___prj/tts/models/vocoder/wavegrad/mozilla/wavegrad-model-output/wavegrad-thorsten-November-30-2020_01+03PM-ac46c3f
Traceback (most recent call last):
  File "./TTS/bin/train_vocoder_wavegrad.py", line 504, in <module>
    main(args)
  File "./TTS/bin/train_vocoder_wavegrad.py", line 401, in main
    epoch)
  File "./TTS/bin/train_vocoder_wavegrad.py", line 116, in train
    noise, x_noisy, noise_scale = model.compute_y_n(x)
  File "/home/thorsten/___prj/tts/models/vocoder/wavegrad/mozilla/lib/python3.6/site-packages/TTS-0.0.7+ac46c3f-py3.6-linux-aarch64.egg/TTS/vocoder/models/wavegrad.py", line 110, in compute_y_n
    noise_scale = l_a + torch.rand(y_0.shape[0]).to(y_0) * (l_b - l_a)
RuntimeError: CUDA error: device-side assert triggered
(mozilla) thorsten@nvidia-agx:~/___prj/tts/models/vocoder/wavegrad/mozilla/TTS$

Platform OS

Linux

Python Environment

Python 3.6.9
Virtual env: Venv / virtualenv

Package Installation

TTS installed from source on GitHub

Click to see package list (package count: 116)

:package: Package list from Pip | Package | Version | | ------------- | ------------- | |absl-py|0.11.0| |astroid|2.4.2| |astunparse|1.6.3| |attrdict|2.0.1| |attrs|20.3.0| |audioread|2.1.9| |bokeh|1.4.0| |cachetools|4.1.1| |cardboardlint|1.3.0| |certifi|2020.11.8| |cffi|1.14.4| |chardet|3.0.4| |click|7.1.2| |clldutils|3.5.4| |colorama|0.4.4| |colorlog|4.6.2| |commonmark|0.9.1| |confuse|1.3.0| |csvw|1.8.1| |cycler|0.10.0| |Cython|0.29.21| |dataclasses|0.7| |decorator|4.4.2| |docopt|0.6.2| |filelock|3.0.12| |Flask|1.1.2| |future|0.18.2| |gast|0.3.3| |gatherup|0.0.4| |gdown|3.12.2| |german-transliterate|0.1.3| |google-auth|1.23.0| |google-auth-oauthlib|0.4.2| |google-pasta|0.2.0| |grpcio|1.33.2| |h5py|2.10.0| |idna|2.10| |importlib-metadata|3.1.0| |importlib-resources|3.3.0| |inflect|5.0.2| |isodate|0.6.0| |isort|4.3.21| |itsdangerous|1.1.0| |Jinja2|2.11.2| |joblib|0.17.0| |Keras-Preprocessing|1.1.2| |kiwisolver|1.3.1| |lazy-object-proxy|1.4.3| |librosa|0.7.2| |llvmlite|0.31.0| |Markdown|3.3.3| |MarkupSafe|1.1.1| |matplotlib|3.3.3| |mccabe|0.6.1| |nose|1.3.7| |num2words|0.5.10| |numba|0.48.0| |numpy|1.18.5| |oauthlib|3.1.0| |opt-einsum|3.3.0| |packaging|20.7| |phonemizer|2.2.1| |Pillow|8.0.1| |pip|20.2.4| |pkg-resources|0.0.0| |prompt-toolkit|3.0.8| |protobuf|3.14.0| |pyasn1|0.4.8| |pyasn1-modules|0.2.8| |pycparser|2.20| |Pygments|2.7.2| |pylint|2.5.3| |pyparsing|2.4.7| |pysbd|0.3.3| |PySocks|1.7.1| |python-dateutil|2.8.1| |pyworld|0.2.12| |PyYAML|5.3.1| |questionary|1.8.1| |regex|2020.11.13| |requests|2.25.0| |requests-oauthlib|1.3.0| |resampy|0.2.2| |rfc3986|1.4.0| |rich|8.0.0| |rsa|4.6| |scikit-learn|0.23.2| |scipy|1.4.1| |segments|2.1.3| |setuptools|50.3.2| |six|1.15.0| |SoundFile|0.10.3.post1| |tabulate|0.8.7| |tensorboard|2.4.0| |tensorboard-plugin-wit|1.7.0| |tensorboardX|2.1| |tensorflow|2.3.0+nv20.9| |tensorflow-estimator|2.3.0| |termcolor|1.1.0| |threadpoolctl|2.1.0| |toml|0.10.2| |torch|1.6.0| |tornado|6.1| |tqdm|4.54.0| |TTS|0.0.7+ac46c3f| |typed-ast|1.4.1| |typing-extensions|3.7.4.3| |umap-learn|0.4.6| |Unidecode|0.4.20| |uritemplate|3.0.1| |urllib3|1.26.2| |wcwidth|0.2.5| |Werkzeug|1.0.1| |wheel|0.35.1| |wrapt|1.12.1| |zipp|3.4.0|

- generated at 13:31 on Nov 30 2020 using Gather Up tool :gift:

erogol commented 3 years ago

it is probably OOM error in a different face.

thorstenMueller commented 3 years ago

That would be the most obvious, but it tried decreasing batchsizes from 96 (default) down to 8 with no effect. I also tried changing lr from 1e-4 (default) to 2e-4 (default value in freds0 wavegrad implementation).

I initially set "stats_path" to the "taco2-scale_stats.npy" from our taco2 model training (is this a good idea in general)? If i leave "stats_path" empty i receive a warning before training is failing with previous written error:

 > TRAINING (2020-11-30 15:32:13) 
/home/thorsten/___prj/tts/models/vocoder/wavegrad/mozilla/lib/python3.6/site-packages/torch/optim/lr_scheduler.py:123: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)

nmstoker commented 3 years ago

I realise each case can be different but I'm pretty sure it went okay for me using the WaveGrad in TTS with the LR at 1e-4, although I did hit some minor problems where I was continuing training (although looks like those aren't quite what you have here).

For the original error, "CUDA error: device-side assert triggered", I think you may be able to get more detail if you run it again with: CUDA_LAUNCH_BLOCKING="1"

thorstenMueller commented 3 years ago

Running with

CUDA_LAUNCH_BLOCKING="1"

didn't provide more debug output.

I ran "compute_statistics.py" with my wavegrad vocoder config file and it provided these output. I used the .npy file as stats_path in vocoder config.

 > There are 22672 files.
100%|████████████████████████████████████████████████████████████████████████████████████████| 22672/22672 [15:03<00:00, 25.09it/s]
 > Avg mel spec mean: -50.53502474642339
 > Avg mel spec scale: 16.255307995251243
 > Avg linear spec mean: -36.591325638551716
 > Avg lienar spec scale: 14.408225508873507
 > stats saved to /tmp/thorsten-wavegrad-vocoder-stats.npy

I think maybe my torch version 1.6 has influence on that. Trying torch version 1.7 next.

lexkoro commented 3 years ago

Did you run it using CUDA_LAUNCH_BLOCKING=1 python ./TTS/bin/train_vocoder_wavegrad.py --config_path ./TTS/vocoder/configs/wavegrad_thorsten.json? This should force cpu execution.

thorstenMueller commented 3 years ago

Yes, @SanjaESC i ran the above command but output shows that CUDA is still used. I also tried "export CUDA_LAUNCH_BLOCKING=1" without success.

I'm asking me if i'm doing something wrong in general if it works for you all, but i'm struggeling with it. Is there a recommenden torch version?

(mozilla) thorsten@nvidia-agx:~/___prj/tts/models/vocoder/wavegrad/mozilla/TTS$ CUDA_LAUNCH_BLOCKING=1 python ./TTS/bin/train_vocoder_wavegrad.py --config_path ./TTS/vocoder/configs/wavegrad_thorsten.json
 > Using CUDA:  True
 > Number of GPUs:  1
   >  Mixed precision is enabled
 > Git Hash: ac46c3f
 > Experiment folder: /home/thorsten/___prj/tts/models/vocoder/wavegrad/mozilla/wavegrad-model-output/wavegrad-thorsten-December-01-2020_07+11AM-ac46c3f

thorstenMueller commented 3 years ago

btw. that's happening on torch 1.7:

 > EPOCH: 0/10000

 > TRAINING (2020-12-01 07:11:57) 
 ! Run is removed from /home/thorsten/___prj/tts/models/vocoder/wavegrad/mozilla/wavegrad-model-output/wavegrad-thorsten-December-01-2020_07+11AM-ac46c3f
Traceback (most recent call last):
  File "./TTS/bin/train_vocoder_wavegrad.py", line 504, in <module>
    main(args)
  File "./TTS/bin/train_vocoder_wavegrad.py", line 401, in main
    epoch)
  File "./TTS/bin/train_vocoder_wavegrad.py", line 127, in train
    raise RuntimeError(f'Detected NaN loss at step {global_step}.')
RuntimeError: Detected NaN loss at step 1.

I tried some workaround from this wavegrad repo (https://github.com/ivanvovk/WaveGrad/issues/8#issuecomment-706800710)

erogol commented 3 years ago

it is the latest dev right?

thorstenMueller commented 3 years ago

Yes. git pull: Already up to date

Screenshot_20201201_151203

erogol commented 3 years ago

It also happens on my side but smaller LR fixed NaN or jumping losses.

thorstenMueller commented 3 years ago

Changing LR value didn't help. I tried different values for mixed_precision true/false, gamma and pad_short without success.

I thought maybe it's because i've some too short audio files in my dataset, so i copied the biggest 500 audio recordings to a mini-dataset but that didn't change anything.

Next i'd try with another dataset to check if my setup or my dataset is the problem. If anybody is interested and has time left here's the download for my dataset which causes the problems. https://drive.google.com/file/d/1mGWfG0s2V2TEg-AI2m85tze1m4pyeM7b/view?usp=sharing

lexkoro commented 3 years ago

Next i'd try with another dataset to check if my setup or my dataset is the problem. If anybody is interested and has time left here's the download for my dataset which causes the problems. https://drive.google.com/file/d/1mGWfG0s2V2TEg-AI2m85tze1m4pyeM7b/view?usp=sharing

I've had the same RuntimeError: CUDA error: device-side assert triggered trying to train on your data.

I think the problem stems from here https://github.com/mozilla/TTS/blob/ac46c3ff4ce7ae85d99b1b40dde0802e31de5767/TTS/vocoder/models/wavegrad.py#L108

There is a chance to get an index which will fail in the next line https://github.com/mozilla/TTS/blob/ac46c3ff4ce7ae85d99b1b40dde0802e31de5767/TTS/vocoder/models/wavegrad.py#L109

So changing line 108 to s = torch.randint(0, self.num_steps, [y_0.shape[0]]) worked, and it seems to train fine.

wavegrad_thorsten.zip

erogol commented 3 years ago

I'm confused. What is the issue here? NaN loss or indexing problem?

thorstenMueller commented 3 years ago

Thanks @SanjaESC for figuring out the problem 👍 . Did you cancel training or is it still running? @erogol I started with torch 1.6.0 and ran into the index problem. After that i tried training using torch 1.7 and ran into the NaN problem.

When i understand @SanjaESC right i'll switch back to torch 1.6 and apply the "fix" for my dataset in line 108 and training should work then.

lexkoro commented 3 years ago

@thorstenMueller I've stopped it. @erogol For me it was the indexing problem.

More Infos:

tested with the current dev branch.
torch version is 1.7

erogol commented 3 years ago

but if you do this

s = torch.randint(0, self.num_steps, [y_0.shape[0]])

The next line could get index -1 since it generates values in a range including 0

lexkoro commented 3 years ago

@erogol I have not really investigated the code yet. But the error only appears when there is a value=1000 inside the tensor s. Guess because it's only 0-999 for the noise schedule.

So s = torch.randint(1, self.num_steps, [y_0.shape[0]]) should be sufficient here, but not sure how it might impact the model performance here.

But thinking about it, wouldn't -1 just be the last value of the list? So guess that's why it also works.

erogol commented 3 years ago

ok then I just wait for @thorstenMueller to debug it I guess :)

lexkoro commented 3 years ago

Taking another look. This should work correctly without getting an -1 index and cover the whole noise_level range.

    s = torch.randint(0, self.num_steps-1, [y_0.shape[0]])
    l_a, l_b = self.noise_level[s], self.noise_level[s+1]

thorstenMueller commented 3 years ago

Thanks a lot @SanjaESC . I've just started a training using your first/previous workaround about 1 hour ago. I thought on adding some debug output if s is lower 0 to get an idea how often and on which steps this could occur.

But this doesn't work because s returns an error which couldn't easily compared with values lower 0.

        s = torch.randint(0, self.num_steps, [y_0.shape[0]])
        if s < 0:
            print("s value is {} at step {}".format(s, self.num_steps))

tensor([645, 306, 965, 393, 343, 920, 204, 108, 845, 619, 210, 198, 117, 233,
        581, 945, 239, 423, 846, 381, 876,  77, 483, 444, 487, 966, 357, 117,
        591, 402, 994, 209])
tensor([ 64, 638, 265, 858, 772, 884, 450, 627, 821,   7, 971, 359, 157,  36,
        611, 264, 369, 260, 490, 315, 237, 105,  48, 825, 111, 371, 584,   1,
        969, 286, 853, 163])

So what do you think - should i cancel current training and apply your newest code change and restart or keep training running to get an idea if the first workaround fails on some point?

erogol commented 3 years ago

Taking another look. This should work correctly without getting an -1 index and cover the whole noise_level range.
    s = torch.randint(0, self.num_steps-1, [y_0.shape[0]])
    l_a, l_b = self.noise_level[s], self.noise_level[s+1]

This is the same as

s = torch.randint(1, self.num_steps + 1, [y_0.shape[0]])
 l_a, l_b = self.noise_level[s-1], self.noise_level[s]

lexkoro commented 3 years ago

Well it shouldn't be the same

The current implementation will return a tensor with values between 1 and 1000 s = torch.randint(1, self.num_steps + 1, [y_0.shape[0]])

self.noise_level being of length 1000

Trying to load self.noise_level at the index 1000 will be out of bounds and result in an error. l_a, l_b = self.noise_level[s-1], self.noise_level[s]

while this s = torch.randint(0, self.num_steps - 1, [y_0.shape[0]]) will return a tensor with values between 0 and 998

so there won't be such problem using l_a, l_b = self.noise_level[s], self.noise_level[s+1]

thorstenMueller commented 3 years ago

Just a brief update. My training with first bugfix from @SanjaESC (https://github.com/mozilla/TTS/issues/581#issuecomment-737239569) is still running (12 hours / step 28k). TB graphs looking good and gt audiosample in tb is sounding quite reasonable.

erogol commented 3 years ago

Well it shouldn't be the same

The current implementation will return a tensor with values between 1 and 1000 s = torch.randint(1, self.num_steps + 1, [y_0.shape[0]])

self.noise_level being of length 1000

Trying to load self.noise_level at the index 1000 will be out of bounds and result in an error. l_a, l_b = self.noise_level[s-1], self.noise_level[s]

while this s = torch.randint(0, self.num_steps - 1, [y_0.shape[0]]) will return a tensor with values between 0 and 998

so there won't be such problem using l_a, l_b = self.noise_level[s], self.noise_level[s+1]

yes right I missed the self.num_steps - 1 part

Would you send a PR for it ?

thorstenMueller commented 3 years ago

If it's okay for all participants, i'd close the issue soon. Thank you guys for your amazing support 👍

mozilla / TTS