r9y9 / deepvoice3_pytorch

PyTorch implementation of convolutional neural networks-based text-to-speech synthesis models
https://r9y9.github.io/deepvoice3_pytorch/
Other
1.97k stars 484 forks source link

Running train.py (binary_criterion stage) getting repeating errors with "Assertion `input >= 0. && input <= 1.` failed" #115

Closed nmstoker closed 5 years ago

nmstoker commented 6 years ago

I'm running train.py on some data I created, and after a variable number of iterations, I keep seeing these assertion fails and it crashes during the binary_criterion (BCELoss) stage.

Is there a recommended approach to debug my data or see more detail about what's going wrong to cause this issue?

My command line is:

python train.py --preset=presets/nyanko_ljspeech.json --data-root=./datasets/processed_neil7 --speaker-id=0

When it crashes it starts with a load of these errors repeating:

/opt/conda/conda-bld/pytorch_1532579245307/work/aten/src/THCUNN/BCECriterion.cu:42: Acctype bce_functor<Dtype, Acctype>::operator()(Tuple) [with Tuple = thrust::detail::tuple_of_iterator_references<thrust::device_reference<float>, thrust::device_reference<float>, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>, Dtype = float, Acctype = float]: block: [0,0,0], thread: [192,0,0] Assertioninput >= 0. && input <= 1.failed.

And the trace then shows:

Traceback (most recent call last):
  File "train.py", line 992, in <module>
    train_seq2seq=train_seq2seq, train_postnet=train_postnet)
  File "train.py", line 689, in train
    done_loss = binary_criterion(done_hat, done)
  File "/home/neil/.conda/envs/deepvoice3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/neil/.conda/envs/deepvoice3/lib/python3.6/site-packages/torch/nn/modules/loss.py", line 486, in forward
    return F.binary_cross_entropy(input, target, weight=self.weight, reduction=self.reduction)
  File "/home/neil/.conda/envs/deepvoice3/lib/python3.6/site-packages/torch/nn/functional.py", line 1603, in binary_cross_entropy
    return torch._C._nn.binary_cross_entropy(input, target, weight, reduction)
RuntimeError: reduce failed to synchronize: device-side assert triggered

The number of iterations observed before it goes wrong varies (typically 50 - 500it).

I suspect something may be wrong in my data (somehow causing the values to be outside the 0 - 1 range the assert is checking for).

I tried cutting down the data until I remove the erroneous entries but it doesn't seem possible. I have 19k entries in the data, and over a long period I systematically removed chunks of data, but that only seems to delay the error (ie it shows up after a greater number of iterations) - unfortunately it does not completely stop the errors.

The audio was aligned with gentle. A few couldn't be aligned (but I was sure to remove those ones).

Any suggestions or things I could check?

nmstoker commented 6 years ago

I need to look into this further, but it seems like it may be due to something in the front-end processing, possibly partially an effect from #104.

After a range of other adjustments, I ended up sorting my training.txt file by sentence length order and then I systematically cut the shortest and longest entries until I got it not to produce the error for a sustained period. I further narrowed it down to removing just the shortest samples of four characters or fewer (ie certain short words said alone) - I have yet to narrow it down to specific words, but once I do that it should let me establish the cause with confidence.

My working theory is that with the short words, when they are being converted to phonemes, if they are one of the type that apparently gets ignored (according to #104) then that may somehow be creating an empty input for that entry. This would explain why it wasn't consistently occurring (since the phoneme conversion is probablistic). However this seems like it might not be the sole cause, because I did still observe errors when I turned the phoneme probability to zero (with the short samples present).

r9y9 commented 6 years ago

I'm afraid that I don't have time to spend for OSS right now and cannot help you much. That said, one debugging tip (sorry if it's obvious for you) is to use CPU instead of GPU. That could give you more informative error messages. What I usually do for debugging in python is:

Hope this will help you a bit.

nmstoker commented 6 years ago

Thanks @r9y9 - I'll give that a go.

And I quite understand, I'm really grateful for the efforts you put into this already. I'll write up what I find here, in case it's useful for others later.

nmstoker commented 6 years ago

Have been short on time, but I narrowed the cause down to some specific audio clips which were short and had particularly low values for n_frames in the train.txt file. As audio they weren't actually corrupted, they were simply very brief clips.

Removing them meant that training proceeds with no error.

When those particular items are kept in the training data, the values within done_hat in this section of train.py seemed to come back as NaN, which in turn results in it crashing here during the BCELoss stage

I need to re-check the return of the NaN value to better understand what is going on, but this may be of help if anyone else sees this kind of thing with their own custom data.

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.