shivammehta25 / Neural-HMM

Neural HMMs are all you need (for high-quality attention-free TTS)
MIT License
148 stars 23 forks source link

How to train a new model with diffirent language? #7

Open Ctibor67 opened 2 years ago

Ctibor67 commented 2 years ago

I would like to know if it possible to train a Neural-HMM for another language, What is needed for this? A cmudict in a new language is required or can be bypassed ? Is there any tutorial to do so?

ghenter commented 2 years ago

Issue #8 makes it sounds like you already have gotten started on training your systems, but I will try to answer the question in this issue nonetheless:

For TTS, you generally need a so-called front-end to first "normalise" the text (e.g., to convert things like "$100" into speakable words like "one hundred dollars") and then convert the normalised text into phones using a phonetic dictionary. Front-ends are different between every language and a good phonetic dictionary is often expensive to produce and can be accent-specific. CMUdict is a US English phonetic dictionary that has the advantage of being free. :)

It is possible that speech synthesis can work OK without a dictionary, i.e., on graphemes instead of phones as input (to my knowledge, this was first shown for English by the Tacotron paper), but I would be at least moderately surprised if it does, especially with monotonic alignments as used by, e.g., our current neural HMMs. You should look around on the Internet to see what resources are available for the language that you are interested in.

Ctibor67 commented 2 years ago

I recently trained at this repo: https://github.com/NVIDIA/tacotron2 . completely without cmudict in the Czech language. And he learned to speak Czech :-) I just wrote the Czech alphabet in symbols.py. Probably each letter was like a separate phoneme... I already have global_step 225000 in training and still nothing . I use the same sound set. But in Tacotron2 I had batch size 20, now in Neural-HMM only 1. Any ideas? (Tacotron2 train starting with "warm_start" and only then did I continue with my own checkpoint)

ghenter commented 2 years ago

This is a known issue. The so-called forward algorithm used by neural HMM training is less memory efficient than Tacotron 2 training, especially for long utterances with many states (phones or graphemes). Synthesis-time memory consumption is not affected.

We are working on implementing approximate maximum-likelihood training using the Viterbi algorithm, which we believe should make the memory consumption of the neural HMMs comparable to Tacotron 2. Watch this space. :)

Ctibor67 commented 2 years ago

I wanted to try continuing training with your model Neural-HMM with command python train.py -c Neural-HMM.ckpt. But it made a mistake size mismatch for model.embedding.weight: copying a param with shape torch.Size([150, 512]) from checkpoint, the shape in current model is torch.Size([179, 512]) When I looked better, the size Neural-HMM.ckpt is 185 813kB and size of my train checkpoint.ckpt is 185 987kB. What do I need to do to continue training your model?

ghenter commented 2 years ago

Just to check, what codebase was used to create the checkpoint that you want to continue training from? This repo (Neural HMM TTS), Nvidia's Tacotron 2 implementation, or some other codebase?

Ctibor67 commented 2 years ago

This repo and download Neural HMM from this link https://kth.app.box.com/v/neural-hmm-pretrained

ghenter commented 2 years ago

OK. And are you trying to continue training on LJ Speech or on another dataset?

To use another dataset, I think the phone set (or at least the number of input symbols) has to match.

Ctibor67 commented 2 years ago

I'm trying to train on my own Czech dataset. My idea was this: When I trained on tacotron2, there was a "warm start" option and the training went well with my Czech dataset. I trained at Neural-HMM from the beginning now, but even after 400k iterations there was no sign of correct speaking. So I wanted to do the same as with Tacotron - start training on your model with my dataset ...

ghenter commented 2 years ago

The pre-trained model was trained on LJ Speech, which is a US English dataset. In general, pre-training on one language and then fine-tuning on another is not a standard TTS use case at the moment, so your approach is likely to require some hacking to get it working, regardless of what codebase you use. Good results are not guaranteed.

One issue you will face is that Tacotron 2 and neural HMMs both contain a step that maps the input symbols to learned embedding vectors that then are fed to the encoder. The list of input symbols that the model accepts is called the phone set. The easiest way to get this step to work is to first map each possible input symbol (graphemes?) for your Czech data onto the US English phone set used by the pre-trained model. This may not be straightforward, however, since there might be sounds that exist in English but not in Czech and vice versa. For the record, I believe our pre-trained model uses the same phone set as CMUdict 0.7b, which I think is given by this file.

If you do not apply a mapping step to turn your text into a series of symbols that the pre-trained model understands, I think there are likely to be dimension mismatches somewhere when you attempt to train the model. You may optionally choose to also introduce your own, new symbols to the phone set, but in that case you will have to create new embedding vectors for these symbols somehow (probably initialised from scratch), and you will also need to do a bit of manual "surgery" on the system to get these vectors into the existing network(s) and make everything match up.

I don't know for certain if the above problem is the reason for the specific error that you received, but it is highly likely that you will have to address problems of the above type in order to be able to take a pre-trained system from one language and fine-tune it on another.

Ctibor67 commented 2 years ago

I installed nvidia tesla k80 for multi gpu training. But when I start training, only one of the two GPUs starts. When I set gpus = [1] in hparams, only the second gpu works. When I set gpus = [0], only the first gpu works. Can you advise me on what to do to make both work?

ghenter commented 2 years ago

Did you try setting gpus = [0, 1]?

Ctibor67 commented 2 years ago

When I setting gpus=[0,1] then Distributed package doesn't have NCCL built in

ghenter commented 2 years ago

Are you running on Windows? A quick Google search shows that PyTorch on Windows does not support the NCCL backend for distributed communications. As written in the torch.distributed documentation:

As of PyTorch v1.8, Windows supports all collective communications backend but NCCL

You can use the instructions here to change to another backend. All other PyTorch-supported backends should work on Windows.

ghenter commented 2 years ago

I trained at Neural-HMM from the beginning now, but even after 400k iterations there was no sign of correct speaking.

My impression is that you are training on graphemes. If I am honest, I think such a long training time without getting the system to speak suggests that starting from a pre-trained system might not help either. As I wrote earlier in this thread:

It is possible that speech synthesis can work OK (...) on graphemes (...) but I would be at least moderately surprised if it does, especially with monotonic alignments as used by, e.g., our current neural HMMs.

When I wrote that, I was not aware of anyone having tried to train a system like neural HMM TTS, with its strictly monotonic alignments, on graphemic input. To my knowledge, you are the first to try that.

Why can there be a difference here between Tacotron 2 and our neural HMM TTS? Even though phonemic input usually gives better results, Tacotron 2 was built to be able to handle graphemic input (though it still tends to fail on logographic writing systems). The reason Tacotron 2 takes so long to train is that it does not know that the input-output alignments will be nearly monotonic (or strictly monotonic, if using phones), so it has to spend a long time learning that monotonicity first. By starting from a pre-trained model (even one from a different language), your model already knows about near-monotonicity from the start and doesn't have to learn it all over again, which is one important reason that fine-tuning tends to be a quicker way to obtain good TTS with Tactotron 2 and similar systems.

Our neural HMMs are restricted to strictly monotonic alignments by design, so they train much faster on phonemic input also from scratch. The downside to this is that they might generalise worse to graphemic input, since they cannot break monotonicity and create those nearly monotonic alignments that graphemic input tends to require or prefer.

It is definitely possible that your issues are due to something else, but in all circumstances I don't think starting from a pre-trained neural HMM will resolve the issues that caused your previous 400k-updates-long neural HMM TTS training to fail.

ghenter commented 2 years ago

What's the best way to move forward, then? Whether you use neural HMM TTS or Tacotron 2, I think you are likely to get better results if you were to use (part of) a TTS front-end to phonetise the Czech text into Czech phones first. I did a quick search, and there are free and open front-ends for other TTS systems in the Czech language; one example is eSpeak NG.

eSpeak NG can do complete TTS, but it is based on formant synthesis and is not likely to sound great (you can try it out in any language here). My recommendation is to first use eSpeak NG, but only to convert text to phones, and then feed the resulting phones into either neural HMM TTS or Tacotron 2. This will again require a bit of manual work to convert files to the right format, make the codebases use the right phone set, etc., but it is likely to be worth the effort.