lingjzhu / charsiu

Charsiu: A neural phonetic aligner.
MIT License
275 stars 33 forks source link

About training #34

Open dourcer opened 1 year ago

dourcer commented 1 year ago

experiments Those were original research code for training the model. Good job. I want to pre-train on my Chinese dataset. I don't know whether the code in experiments is OK. If so, can you write the training instructions roughly?

lingjzhu commented 1 year ago

Sorry for the late reply! I really appreciate your interest in our work!

This script is the original training script I used. Training data for English are available here and here, but you might need to deduplicate them to train an alignment model. Audios are not distributed but you can easily find Librispeech and TIMIT audios.

  1. You need to modify the data collator and the tokenizer to suit your data.
  2. Please first train the model using a dataset with short sentences (1-2 seconds). After that, visually inspect whether diagonal alignment patterns have formed. If not, you might need to change some hyperparameters to try again.
  3. Training the model on longer utterances to improve the performance.

The hardest part is to do curriculum training on the model to foster the formation of diagonal alignment patterns. It might be easier if you start from short utterances. If there is no diagonal alignment pattern after 100 iterations, it is usually very hard to correct it in subsequent iterations. After the paper was published, we found that initializing the model with an CTC-based ASR model or a frame classification model could help the diagonal formation. Adding a diagonal prior also helps according to others but I didn't have any luck with that. I think the shortcoming of this method is that it is very unstable and could be frustrating to train. So I am currently working on other methods to make it easier.