Gibberish audio on preliminary training

sh-lee-prml / HierSpeechpp

The official implementation of HierSpeech++

MIT License

1.17k stars 134 forks source link

Gibberish audio on preliminary training #30

Closed Pranjalya closed 8 months ago

Pranjalya commented 8 months ago

Thanks for the repo. I had around 30 hours of custom Hindi data which I wanted to train and test the model on. Training only the TTV part on 4x A6000 GPUs with 64 batch size, I tried inferencing with the provided VC checkpoint, but I was getting unintelligible results. There was no distortion audio, but just unintelligible speech. What do you reckon could be the cause of it? Was it because I will need a VC model for Hindi as well, or my training steps were low for getting the results? And after how many steps, do you think we can expect a checkpoint on which we can get a decent preliminary result? Thanks again!

hayeong0 commented 8 months ago

What does 'unintelligible speech' mean? Can I see your training logs or Tensorboard?

We have used about 60 hours of Hindi data from LIMMITS and have experience using phonemizer. Have you checked if tokens were properly extracted during the training and inference stages?

sh-lee-prml commented 8 months ago

I have attached tensorboard loss curves for our TTV v1 model which was trained with LibriTTS-960 dataset. we used 4x GPUs with 128 batch size (32 per GPU).

285719622-21e3d5cc-199a-4437-b334-adefb0c693f8 .

How about the ctc loss curve you trained? Our checkpoint is from 930k steps.

and

I actually do not know Hindi language well... but I think Phonemizer may not be good for Hindi Language. In this case, how about using other tokenizer?

Pranjalya commented 8 months ago

We have used phonemizer as well, and from past experience, it works decently for Hindi as well. Here are my logs:

"unintelligible" means like it sounded like it was speaking clearly but nothing related to the text and not in the language. But again, it was just with 20k steps checkpoint.

rishikksh20 commented 8 months ago

@hayeong0 from how many steps onward we start getting some audible voice when train TTV from scratch ?

Pranjalya commented 8 months ago

Just for reference, the audio from 20k steps.

https://github.com/sh-lee-prml/HierSpeechpp/assets/36627085/25cc5bc9-b262-4150-a673-39f6bc6ebca8

sh-lee-prml commented 8 months ago

Here is our results from 10k, 20k, 50k, 100k, 200k, 950k. (with hierspeech synthesizer v1)

I have attached audio for some text and speaker of libritts-test-clean.

Link

When using LibriTTS dataset, the 10k steps model can synthesize an audible speech.

Thanks!

Pranjalya commented 8 months ago

Thank you very much, it helped.