Closed rild closed 6 years ago
I usually work with a toy dataset almost similar size as yours for sanity check; but the alignment shows up quickly. try two things: 1. injection of speaker id as issue #18 . and 2. Monotonic attention
@AzamRabiee , Thank you for your reply!
I didn't know the paper: Monotonic attention
I would like to read the issue and monotonic paper, and try that.
same problem ..?
If you are building multi-speaker models (like DeepVoice2, or 3), it should be okay to use data from multiple speakers. However, I'm pretty sure that the reason you got the non-monotonic alignment is that you don't have sufficient data. https://sites.google.com/site/shinnosuketakamichi/publication/jsut is a freely available Japanese dataset that might be useful for you. The dataset consists of 10 hours audio recordings of a single female speaker. I just started to explore the dataset today with DeepVoice3 architecture and can get nearly monotonic attention very quickly as follows:
Code is available at https://github.com/r9y9/deepvoice3_pytorch. If you are interested, feel free to contact me.
@r9y9 , Thank you!
I was just trying to make simplest TTS system, so there was no intention to build multi-speaker model now.
I got the training dataset from here.
I feel the corpuses seems to be too long to train tacotron.
Yesterday, I run train.py
with 100 speech data (single, no-emotional speech dataset), but I didn't get good result.
the dataset (arxiv) looks good for training!
@r9y9
If you are interested, feel free to contact me.
I feel very encouraged to hear that. Thank you so much. I would like to send a email you later.
TL;DR
Is it better to use the same speaker's voice data for learning?
Question
I am a student studying speech synthesis at university in Japan.
I read paper about Tacotron and wanted to try it myself. So, I try to train tacotron in Japanese. (I was able to confirm that training in English works as intended when using the training data available in the repository)
The problem is that encoder/decoder alignment for Japanese would not be learned well.
Here is
step-23000-align.png
For training, the data used is speech data from three speakers, each reading 100 sentences.
I suppose the reason why alignment fails, is as follows:
Thank you.