jaywalnut310 / glow-tts

A Generative Flow for Text-to-Speech via Monotonic Alignment Search
MIT License
651 stars 150 forks source link

trained with russian dataset, results do not sounds as good as demo samples #24

Open hadaev8 opened 4 years ago

hadaev8 commented 4 years ago

Here is generated samples https://drive.google.com/drive/folders/1e4xHQ3XX180QFF2aDBEDwu-lVE9e47_g?usp=sharing The voice does not sounds natural. How do you think, should 8 gpus make it worse?

I added stress embedding because stress every important here. This is my changes:

         self.emb = nn.Embedding(n_vocab, hidden_channels, padding_idx=0)
         nn.init.normal_(self.emb.weight[1:], 0.0, hidden_channels**-0.5)
         self.stress_emb = nn.Embedding(3, hidden_channels, padding_idx=0)
         nn.init.normal_(self.stress_emb.weight[1:], 0.0, hidden_channels**-0.5)
         ...
         x = self.emb(x) + self.stress_emb(stress)
         x = x * math.sqrt(self.hidden_channels)  # [b, t, h]

Any advice?

marlon-br commented 4 years ago

Here is generated samples https://drive.google.com/drive/folders/1e4xHQ3XX180QFF2aDBEDwu-lVE9e47_g?usp=sharing The voice does not sounds natural. How do you think, should 8 gpus make it worse?

I added stress embedding because stress every important here. This is my changes:

         self.emb = nn.Embedding(n_vocab, hidden_channels, padding_idx=0)
         nn.init.normal_(self.emb.weight[1:], 0.0, hidden_channels**-0.5)
         self.stress_emb = nn.Embedding(3, hidden_channels, padding_idx=0)
         nn.init.normal_(self.stress_emb.weight[1:], 0.0, hidden_channels**-0.5)
         ...
         x = self.emb(x) + self.stress_emb(stress)
         x = x * math.sqrt(self.hidden_channels)  # [b, t, h]

Any advice?

could you please share the steps, how you did this for russian language? how many hours of speaking did you use?

hadaev8 commented 4 years ago

Well, I changed tokenization to Cyrillic symbols and stress embedding as above. 40 hours of data.