Closed thorstenMueller closed 3 years ago
it is probably OOM error in a different face.
That would be the most obvious, but it tried decreasing batchsizes from 96 (default) down to 8 with no effect. I also tried changing lr from 1e-4 (default) to 2e-4 (default value in freds0 wavegrad implementation).
I initially set "stats_path" to the "taco2-scale_stats.npy" from our taco2 model training (is this a good idea in general)? If i leave "stats_path" empty i receive a warning before training is failing with previous written error:
> TRAINING (2020-11-30 15:32:13)
/home/thorsten/___prj/tts/models/vocoder/wavegrad/mozilla/lib/python3.6/site-packages/torch/optim/lr_scheduler.py:123: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
I realise each case can be different but I'm pretty sure it went okay for me using the WaveGrad in TTS with the LR at 1e-4, although I did hit some minor problems where I was continuing training (although looks like those aren't quite what you have here).
For the original error, "CUDA error: device-side assert triggered", I think you may be able to get more detail if you run it again with: CUDA_LAUNCH_BLOCKING="1"
Running with
CUDA_LAUNCH_BLOCKING="1"
didn't provide more debug output.
I ran "compute_statistics.py" with my wavegrad vocoder config file and it provided these output. I used the .npy file as stats_path in vocoder config.
> There are 22672 files.
100%|████████████████████████████████████████████████████████████████████████████████████████| 22672/22672 [15:03<00:00, 25.09it/s]
> Avg mel spec mean: -50.53502474642339
> Avg mel spec scale: 16.255307995251243
> Avg linear spec mean: -36.591325638551716
> Avg lienar spec scale: 14.408225508873507
> stats saved to /tmp/thorsten-wavegrad-vocoder-stats.npy
I think maybe my torch version 1.6 has influence on that. Trying torch version 1.7 next.
Did you run it using CUDA_LAUNCH_BLOCKING=1 python ./TTS/bin/train_vocoder_wavegrad.py --config_path ./TTS/vocoder/configs/wavegrad_thorsten.json
? This should force cpu execution.
Yes, @SanjaESC i ran the above command but output shows that CUDA is still used. I also tried "export CUDA_LAUNCH_BLOCKING=1" without success.
I'm asking me if i'm doing something wrong in general if it works for you all, but i'm struggeling with it. Is there a recommenden torch version?
(mozilla) thorsten@nvidia-agx:~/___prj/tts/models/vocoder/wavegrad/mozilla/TTS$ CUDA_LAUNCH_BLOCKING=1 python ./TTS/bin/train_vocoder_wavegrad.py --config_path ./TTS/vocoder/configs/wavegrad_thorsten.json
> Using CUDA: True
> Number of GPUs: 1
> Mixed precision is enabled
> Git Hash: ac46c3f
> Experiment folder: /home/thorsten/___prj/tts/models/vocoder/wavegrad/mozilla/wavegrad-model-output/wavegrad-thorsten-December-01-2020_07+11AM-ac46c3f
btw. that's happening on torch 1.7:
> EPOCH: 0/10000
> TRAINING (2020-12-01 07:11:57)
! Run is removed from /home/thorsten/___prj/tts/models/vocoder/wavegrad/mozilla/wavegrad-model-output/wavegrad-thorsten-December-01-2020_07+11AM-ac46c3f
Traceback (most recent call last):
File "./TTS/bin/train_vocoder_wavegrad.py", line 504, in <module>
main(args)
File "./TTS/bin/train_vocoder_wavegrad.py", line 401, in main
epoch)
File "./TTS/bin/train_vocoder_wavegrad.py", line 127, in train
raise RuntimeError(f'Detected NaN loss at step {global_step}.')
RuntimeError: Detected NaN loss at step 1.
I tried some workaround from this wavegrad repo (https://github.com/ivanvovk/WaveGrad/issues/8#issuecomment-706800710)
it is the latest dev right?
Yes. git pull: Already up to date
It also happens on my side but smaller LR fixed NaN or jumping losses.
Changing LR value didn't help. I tried different values for mixed_precision true/false, gamma and pad_short without success.
I thought maybe it's because i've some too short audio files in my dataset, so i copied the biggest 500 audio recordings to a mini-dataset but that didn't change anything.
Next i'd try with another dataset to check if my setup or my dataset is the problem. If anybody is interested and has time left here's the download for my dataset which causes the problems. https://drive.google.com/file/d/1mGWfG0s2V2TEg-AI2m85tze1m4pyeM7b/view?usp=sharing
Next i'd try with another dataset to check if my setup or my dataset is the problem. If anybody is interested and has time left here's the download for my dataset which causes the problems. https://drive.google.com/file/d/1mGWfG0s2V2TEg-AI2m85tze1m4pyeM7b/view?usp=sharing
I've had the same RuntimeError: CUDA error: device-side assert triggered
trying to train on your data.
I think the problem stems from here https://github.com/mozilla/TTS/blob/ac46c3ff4ce7ae85d99b1b40dde0802e31de5767/TTS/vocoder/models/wavegrad.py#L108
There is a chance to get an index which will fail in the next line https://github.com/mozilla/TTS/blob/ac46c3ff4ce7ae85d99b1b40dde0802e31de5767/TTS/vocoder/models/wavegrad.py#L109
So changing line 108 to s = torch.randint(0, self.num_steps, [y_0.shape[0]])
worked, and it seems to train fine.
I'm confused. What is the issue here? NaN loss or indexing problem?
Thanks @SanjaESC for figuring out the problem 👍 . Did you cancel training or is it still running? @erogol I started with torch 1.6.0 and ran into the index problem. After that i tried training using torch 1.7 and ran into the NaN problem.
When i understand @SanjaESC right i'll switch back to torch 1.6 and apply the "fix" for my dataset in line 108 and training should work then.
@thorstenMueller I've stopped it. @erogol For me it was the indexing problem.
More Infos:
but if you do this
s = torch.randint(0, self.num_steps, [y_0.shape[0]])
The next line could get index -1 since it generates values in a range including 0
@erogol I have not really investigated the code yet. But the error only appears when there is a value=1000 inside the tensor s. Guess because it's only 0-999 for the noise schedule.
So s = torch.randint(1, self.num_steps, [y_0.shape[0]])
should be sufficient here, but not sure how it might impact the model performance here.
But thinking about it, wouldn't -1 just be the last value of the list? So guess that's why it also works.
ok then I just wait for @thorstenMueller to debug it I guess :)
Taking another look. This should work correctly without getting an -1 index and cover the whole noise_level range.
s = torch.randint(0, self.num_steps-1, [y_0.shape[0]])
l_a, l_b = self.noise_level[s], self.noise_level[s+1]
Thanks a lot @SanjaESC . I've just started a training using your first/previous workaround about 1 hour ago. I thought on adding some debug output if s is lower 0 to get an idea how often and on which steps this could occur.
But this doesn't work because s returns an error which couldn't easily compared with values lower 0.
s = torch.randint(0, self.num_steps, [y_0.shape[0]])
if s < 0:
print("s value is {} at step {}".format(s, self.num_steps))
tensor([645, 306, 965, 393, 343, 920, 204, 108, 845, 619, 210, 198, 117, 233,
581, 945, 239, 423, 846, 381, 876, 77, 483, 444, 487, 966, 357, 117,
591, 402, 994, 209])
tensor([ 64, 638, 265, 858, 772, 884, 450, 627, 821, 7, 971, 359, 157, 36,
611, 264, 369, 260, 490, 315, 237, 105, 48, 825, 111, 371, 584, 1,
969, 286, 853, 163])
So what do you think - should i cancel current training and apply your newest code change and restart or keep training running to get an idea if the first workaround fails on some point?
Taking another look. This should work correctly without getting an -1 index and cover the whole noise_level range.
s = torch.randint(0, self.num_steps-1, [y_0.shape[0]]) l_a, l_b = self.noise_level[s], self.noise_level[s+1]
This is the same as
s = torch.randint(1, self.num_steps + 1, [y_0.shape[0]])
l_a, l_b = self.noise_level[s-1], self.noise_level[s]
Well it shouldn't be the same
The current implementation will return a tensor with values between 1 and 1000
s = torch.randint(1, self.num_steps + 1, [y_0.shape[0]])
self.noise_level being of length 1000
Trying to load self.noise_level at the index 1000 will be out of bounds and result in an error.
l_a, l_b = self.noise_level[s-1], self.noise_level[s]
while this
s = torch.randint(0, self.num_steps - 1, [y_0.shape[0]])
will return a tensor with values between 0 and 998
so there won't be such problem using
l_a, l_b = self.noise_level[s], self.noise_level[s+1]
Just a brief update. My training with first bugfix from @SanjaESC (https://github.com/mozilla/TTS/issues/581#issuecomment-737239569) is still running (12 hours / step 28k). TB graphs looking good and gt audiosample in tb is sounding quite reasonable.
Well it shouldn't be the same
The current implementation will return a tensor with values between 1 and 1000
s = torch.randint(1, self.num_steps + 1, [y_0.shape[0]])
self.noise_level being of length 1000
Trying to load self.noise_level at the index 1000 will be out of bounds and result in an error.
l_a, l_b = self.noise_level[s-1], self.noise_level[s]
while this
s = torch.randint(0, self.num_steps - 1, [y_0.shape[0]])
will return a tensor with values between 0 and 998so there won't be such problem using
l_a, l_b = self.noise_level[s], self.noise_level[s+1]
yes right I missed the self.num_steps - 1
part
Would you send a PR for it ?
If it's okay for all participants, i'd close the issue soon. Thank you guys for your amazing support 👍
First things first: Thanks @nmstoker for providing great gatherup tool :+1: .
Details
While in epoch 0 i get following error:
Platform OS
Python Environment
Python 3.6.9
Virtual env: Venv / virtualenv
Package Installation
TTS installed from source on GitHub
Click to see package list (package count: 116)
:package: Package list from Pip | Package | Version | | ------------- | ------------- | |absl-py|0.11.0| |astroid|2.4.2| |astunparse|1.6.3| |attrdict|2.0.1| |attrs|20.3.0| |audioread|2.1.9| |bokeh|1.4.0| |cachetools|4.1.1| |cardboardlint|1.3.0| |certifi|2020.11.8| |cffi|1.14.4| |chardet|3.0.4| |click|7.1.2| |clldutils|3.5.4| |colorama|0.4.4| |colorlog|4.6.2| |commonmark|0.9.1| |confuse|1.3.0| |csvw|1.8.1| |cycler|0.10.0| |Cython|0.29.21| |dataclasses|0.7| |decorator|4.4.2| |docopt|0.6.2| |filelock|3.0.12| |Flask|1.1.2| |future|0.18.2| |gast|0.3.3| |gatherup|0.0.4| |gdown|3.12.2| |german-transliterate|0.1.3| |google-auth|1.23.0| |google-auth-oauthlib|0.4.2| |google-pasta|0.2.0| |grpcio|1.33.2| |h5py|2.10.0| |idna|2.10| |importlib-metadata|3.1.0| |importlib-resources|3.3.0| |inflect|5.0.2| |isodate|0.6.0| |isort|4.3.21| |itsdangerous|1.1.0| |Jinja2|2.11.2| |joblib|0.17.0| |Keras-Preprocessing|1.1.2| |kiwisolver|1.3.1| |lazy-object-proxy|1.4.3| |librosa|0.7.2| |llvmlite|0.31.0| |Markdown|3.3.3| |MarkupSafe|1.1.1| |matplotlib|3.3.3| |mccabe|0.6.1| |nose|1.3.7| |num2words|0.5.10| |numba|0.48.0| |numpy|1.18.5| |oauthlib|3.1.0| |opt-einsum|3.3.0| |packaging|20.7| |phonemizer|2.2.1| |Pillow|8.0.1| |pip|20.2.4| |pkg-resources|0.0.0| |prompt-toolkit|3.0.8| |protobuf|3.14.0| |pyasn1|0.4.8| |pyasn1-modules|0.2.8| |pycparser|2.20| |Pygments|2.7.2| |pylint|2.5.3| |pyparsing|2.4.7| |pysbd|0.3.3| |PySocks|1.7.1| |python-dateutil|2.8.1| |pyworld|0.2.12| |PyYAML|5.3.1| |questionary|1.8.1| |regex|2020.11.13| |requests|2.25.0| |requests-oauthlib|1.3.0| |resampy|0.2.2| |rfc3986|1.4.0| |rich|8.0.0| |rsa|4.6| |scikit-learn|0.23.2| |scipy|1.4.1| |segments|2.1.3| |setuptools|50.3.2| |six|1.15.0| |SoundFile|0.10.3.post1| |tabulate|0.8.7| |tensorboard|2.4.0| |tensorboard-plugin-wit|1.7.0| |tensorboardX|2.1| |tensorflow|2.3.0+nv20.9| |tensorflow-estimator|2.3.0| |termcolor|1.1.0| |threadpoolctl|2.1.0| |toml|0.10.2| |torch|1.6.0| |tornado|6.1| |tqdm|4.54.0| |TTS|0.0.7+ac46c3f| |typed-ast|1.4.1| |typing-extensions|3.7.4.3| |umap-learn|0.4.6| |Unidecode|0.4.20| |uritemplate|3.0.1| |urllib3|1.26.2| |wcwidth|0.2.5| |Werkzeug|1.0.1| |wheel|0.35.1| |wrapt|1.12.1| |zipp|3.4.0|
- generated at 13:31 on Nov 30 2020 using Gather Up tool :gift: