Closed ycat3 closed 4 years ago
Sometimes, duration predictor failed to predict durations. Maybe that is the reason of the failure. By introducing the conformer, the convolution module can capture the local context better than the standard transformer, and therefore, the conformer-fastpseech2 is more robust to the input text content.
I also confirmed the above behavior with just_fastspeech2
.
Interestingly,
この谿谷の、最も深いところには、木曽福島の関所も、隠れていた
is failed この谿谷の、最も深いところには、福島の関所も、隠れていた
is OKこの木曽の、最も深いところには、木曽福島の関所も、隠れていた
is OKThe Transformer behavior is mysterious :(
If you have a further discussion, please re-open.
In espnet2_tts_demo.ipynb google colab and in my local notebook shows strange problem. When I input the following sentence, "この谿谷の、最も深いところには、木曽福島の関所も、隠れていた。" last character is missing, no sound.
tag = "kan-bayashi/jsut_fastspeech2" vocoder_tag = "jsut_multi_band_melgan.v2"
Other tags including "kan-bayashi/jsut_conformer_fastspeech2" does not show the problem.