Open yyt233 opened 6 years ago
One of the approaches is to train some extra stop token targets to decide when to stop decoding rather than see whether the time step reaches the token targets length.
@begeekmyfriend Did you successfully solve the echo problem with this method? What are the detailed steps?Could you please explain it in detail? I tried to trim the silence of the begin and end,and add 1s silence at the end of the audio,but these do not work.
That is quite simple. We can synchronize stop token targets with the length of mel targets and pad the with _token_pad
indicating the stop time step.
@begeekmyfriend Thank you very much! I'll have a try.
As the paper said, We train using a batch size of 32, where all sequences are padded to a max length. It’s a common practice to train sequence models with a loss mask, which masks loss on zero-padded frames.However, we found that models trained this way don’t know when to stop emitting outputs, causing repeated sounds towards the end. One simple trick to get around this problem is to also reconstruct the zero-padded frames.
This seems to be the author's method of eliminating echo. So,do you have any ideas for reconstructing the zero-padded frames?@keithito Thank you!