Closed nmstoker closed 5 years ago
I need to look into this further, but it seems like it may be due to something in the front-end processing, possibly partially an effect from #104.
After a range of other adjustments, I ended up sorting my training.txt file by sentence length order and then I systematically cut the shortest and longest entries until I got it not to produce the error for a sustained period. I further narrowed it down to removing just the shortest samples of four characters or fewer (ie certain short words said alone) - I have yet to narrow it down to specific words, but once I do that it should let me establish the cause with confidence.
My working theory is that with the short words, when they are being converted to phonemes, if they are one of the type that apparently gets ignored (according to #104) then that may somehow be creating an empty input for that entry. This would explain why it wasn't consistently occurring (since the phoneme conversion is probablistic). However this seems like it might not be the sole cause, because I did still observe errors when I turned the phoneme probability to zero (with the short samples present).
I'm afraid that I don't have time to spend for OSS right now and cannot help you much. That said, one debugging tip (sorry if it's obvious for you) is to use CPU instead of GPU. That could give you more informative error messages. What I usually do for debugging in python is:
import ipdb; ipdb.set_trace()
where you want to debugCUDA_VISIBLE_DEVICES="-1" train.py ${args}
)Hope this will help you a bit.
Thanks @r9y9 - I'll give that a go.
And I quite understand, I'm really grateful for the efforts you put into this already. I'll write up what I find here, in case it's useful for others later.
Have been short on time, but I narrowed the cause down to some specific audio clips which were short and had particularly low values for n_frames in the train.txt file. As audio they weren't actually corrupted, they were simply very brief clips.
Removing them meant that training proceeds with no error.
When those particular items are kept in the training data, the values within done_hat in this section of train.py seemed to come back as NaN, which in turn results in it crashing here during the BCELoss stage
I need to re-check the return of the NaN value to better understand what is going on, but this may be of help if anyone else sees this kind of thing with their own custom data.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
I'm running train.py on some data I created, and after a variable number of iterations, I keep seeing these assertion fails and it crashes during the binary_criterion (BCELoss) stage.
Is there a recommended approach to debug my data or see more detail about what's going wrong to cause this issue?
My command line is:
When it crashes it starts with a load of these errors repeating:
/opt/conda/conda-bld/pytorch_1532579245307/work/aten/src/THCUNN/BCECriterion.cu:42: Acctype bce_functor<Dtype, Acctype>::operator()(Tuple) [with Tuple = thrust::detail::tuple_of_iterator_references<thrust::device_reference<float>, thrust::device_reference<float>, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>, Dtype = float, Acctype = float]: block: [0,0,0], thread: [192,0,0] Assertion
input >= 0. && input <= 1.failed.
And the trace then shows:
The number of iterations observed before it goes wrong varies (typically 50 - 500it).
I suspect something may be wrong in my data (somehow causing the values to be outside the 0 - 1 range the assert is checking for).
I tried cutting down the data until I remove the erroneous entries but it doesn't seem possible. I have 19k entries in the data, and over a long period I systematically removed chunks of data, but that only seems to delay the error (ie it shows up after a greater number of iterations) - unfortunately it does not completely stop the errors.
The audio was aligned with gentle. A few couldn't be aligned (but I was sure to remove those ones).
Any suggestions or things I could check?