resemble-ai / Resemblyzer

A python package to analyze and compare voices with deep learning
Apache License 2.0
2.66k stars 419 forks source link

Re-train on my language domain #41

Open v-nhandt21 opened 3 years ago

v-nhandt21 commented 3 years ago

Thank you very much for the amazing repo!

I am working on my thesis, can you recommend me anyway to train this in my specific language. I have tried this with the result is 72% accuracy but I can not improve it anymore.

I mean how can I train to get the pretrained.pt for my domain. Thank you!

riccardomusmeci commented 3 years ago

Hi,

I actually re-trained the model with a custom dataset.. it's hard to say how the technique to pursue. You can either start the training from the point in which the original resemblyzer has stopped or start from scratch. In any case, you must try with a huge dataset to allow the model to generalize for your use case. Overall I tried with 20 people and 2 hours of total recordings and the model is still not good.

tranctan commented 3 years ago

You can start using the pretrained model on your own dataset to see how well it works. As the author mentioned, the pretrained is already trained on VoxCeleb1, VoxCeleb2 and LibriSpeech-other, which contains multiple languages already, hence, it is expected to work for other languages in some manner.

In case the pretrained is not good enough, then as @riccardomusmeci already mentioned, there are 2 possible ways: Fine-tuning the pretrained model or training from scratch.

If your dataset is large enough (hundreads of speakers with thousands of hours), you can try training from scratch. On the other hand, fine-tuning is a decent way to go.

michaelgfeldman commented 3 years ago

Hi, @tranctan Can you tell me approximately how many speakers and how many hours do I need to fine-tune it to my language (Russian)? Thanks in advance!

tranctan commented 3 years ago

I believe that there would be no optimal answer to this. You can start trying fine-tuning with all you got and see how well the model converge. If you don't have enough data, you may encounter NaN loss. Also, it's worth using the pre-trained to test on your language first, so that you can have a baseline solution and something to compare.

For more information, you can follow this issue thread, you may find many valuable information from the author.

I fine-tuned on a Vietnamese dataset (10k+ utterances and 500+ speakers) and the model performed quite well.

Finally, the more you train, the more clustered the embeddings, which gives higher confidence.

ngocanh2162 commented 3 years ago

@tranctan mình train lại synthesizer bằng tiếng việt (1 bộ có >50h chất lượng tốt với ~150 người nói, 1 bộ có ~400h chất lượng kém gần 150ng nói), giọng nói sinh ra khá ổn, trừ việc một số từ bị phát âm rung giống như khàn trong cổ họng. Không biết bạn có gặp phải vấn đề giống thế này không?

tranctan commented 3 years ago

@ngocanh2162 Khi làm việc với các mô hình generative cho speech thì các từ bị phát âm như bị khàn mình rất hay gặp nhưng nguyên nhân thì mỗi bài toán, mỗi dữ liệu nó có 1 kiểu khác nhau nên mình cũng không chắc gì hết.

Đối với trường hợp của bạn mình nghĩ nguyên nhân có thể đến từ data có chất lượng không tốt (mặc dù khá nhiều) - Bạn thử lọc lại chỉ giữ những data tốt mặc dù ít hơn rồi trial-and error xem sao. Trong phần synthesizing có 2 model là Synthesizer (text-to-mel) và Vocoder (mel-to-wav), bạn cũng có thể kiểm tra xem việc khàn giọng đến từ model nào, chẳng hạn vẽ ra các mel-spectrogram của Synthesizer xem thử.

Nếu cải thiện data không có kết quả thì bạn có thể thử chỉnh hyper-param khác đi hoặc lựa chọn những model TTS tốt hơn (Tacotron 2 + ParallelWaveGAN) chẳng hạn. Vì trong repo Voice Cloning tác giả sử dụng Tacotron và Wavenet, còn giờ thì có nhiều model tốt hơn rồi.

P/s: Bạn có thể tạo 1 issue bên repo Voice Cloning hỏi về vấn đề của bạn xem, vì repo này chỉ dành cho speaker encoder thì phải.

michaelgfeldman commented 3 years ago

Hi again!) @tranctan Could you tell what was the average length of an utterance in your Vietnamese dataset? Did you use "freezing" of some layers while fine-tuning or you just did 'python encoder_train.py pretrained'

tranctan commented 3 years ago

The average length is about 5s. In my observation, this model works well with utterance length ranging 5-7s and quite bad for the much shorter utterances (< 1s).

I did not freeze any layers, just merely continue training the pretrained model with my new dataset and it works just fine. My 2cent is, the pretrained model is based on mostly English, not a big multi-language dataset so maybe freezing some layers + fine-tuning may not give good results. If your language of choice is closed to English, freezing could give a positive result with shorter training time.

michaelgfeldman commented 3 years ago

@tranctan Do you have an idea how many 5-7s utterances per speaker will be enough? I see that you have 20 utterances per speaker on average but maybe I can get away with 5 or 10? (I have tried your suggestions on a toy-dataset and the results were very promising, thanks a lot for the help you have already provided)

tranctan commented 3 years ago

@michaelgfeldman Glad that I could help!

For the question, I actually have no idea. The only practical thing I could think of is just giving it a try and see how it goes.

thanhlong1997 commented 3 years ago

The average length is about 5s. In my observation, this model works well with utterance length ranging 5-7s and quite bad for the much shorter utterances (< 1s).

I did not freeze any layers, just merely continue training the pretrained model with my new dataset and it works just fine. My 2cent is, the pretrained model is based on mostly English, not a big multi-language dataset so maybe freezing some layers + fine-tuning may not give good results. If your language of choice is closed to English, freezing could give a positive result with shorter training time.

Hi anh , em cũng đang gặp trường hợp bị fail với những voice <1s , không biết anh có cách nào để giải quyết chưa ạ ? Liệu sampling đoạn voice đấy để training lại có work không và có ảnh hưởng gì đến những trường hợp bình thường không ? Thanks!!

tranctan commented 3 years ago

Hi, I would love you to answer this in English so that any other people can reference it (though we're Vietnamese lol).

For very short utterances, what I did was duplicating it multiple times to make it longer than 5s before feeding into the model to inference (kinda hacky). The result turned out not really positive on my site (the similarity score is still quite low to be considered as the same speaker). But you can try for once with your own data. Please inform me back if this works haha.

The other approach would be collect those very short utterances and re-train the model again as you mentioned, just make sure the amount of very short utterances is enough. Remember to re-train, not keep on training with newly-collected-very-short utterances, or the model itself will update the weights based on the very short utterances and forget about the normal cases that you trained previously. Theoretically, this will give the model the ability to perform better on very short utterances, but in reality, we need to try and observe how the model actually behaves.

Hope my answer helps.

thangnkHust commented 3 years ago

@tranctan bạn ơi, mình mới lần đầu tìm hiểu về vấn đề này Bạn có thể cho mình xin cách re-train lại với dữ liệu tiếng việt được không ?

Hiện tại thì mình cần sử dụng phần Speaker Diarization để có thể từ input là 1 audio, output là 1 mảng gồm các (speaker_id, time_start, time_end)

Mình có thể vào đâu để tìm được cách export ra được 1 model giống như pretrained.pt ở Repo này thế ạ ? Mình có xem bên Voice Cloning nhưng không biết là để re-train ra 1 model giống như pretrained.pt thì mình phải chạy như thế nào ạ ?

Cảm ơn bạn!

thanhlong1997 commented 3 years ago

Hi, I would love you to answer this in English so that any other people can reference it (though we're Vietnamese lol).

For very short utterances, what I did was duplicating it multiple times to make it longer than 5s before feeding into the model to inference (kinda hacky). The result turned out not really positive on my site (the similarity score is still quite low to be considered as the same speaker). But you can try for once with your own data. Please inform me back if this works haha.

The other approach would be collect those very short utterances and re-train the model again as you mentioned, just make sure the amount of very short utterances is enough. Remember to re-train, not keep on training with newly-collected-very-short utterances, or the model itself will update the weights based on the very short utterances and forget about the normal cases that you trained previously. Theoretically, this will give the model the ability to perform better on very short utterances, but in reality, we need to try and observe how the model actually behaves.

Hope my answer helps.

Hi sir , another problem is : VAD from this library is webrtcvad and it did not detect some noise from audio like Vehicle sounds and of course when use spectral clustering , it becomes a new cluster and make the result wrong . I try to ignore this problem but with phone call audio , i cant handle all possible case . There is any solution can handle VAD problem in VietNamese language for the phone call data ? Thanks !!!

tranctan commented 3 years ago

It's sorry that I don't have any experience with VAD or handling noise. But I think you can consider noise reduction/suppression/cancellation techniques, or whichever algorithms that cancel out the background noise as preprocessing step.

On the other hand, if the number of samples that includes background noise is large enough, I think it's worth trying to re-train the model with those samples (let the model itself learns to distinguish the samples with noises).

Just my 2cents. Hope it helps you.

michaelgfeldman commented 2 years ago

Hi @tranctan, long time no see! I have a new task, now I have to fine-tune the model to Ukrainian language, so I have a couple of questions:

  1. How do you know when to stop training? In the provided visdom enviroment we can see the UMAP, Loss, EER but it's all based on a training data (correct me if i'm wrong) so it will be getting better and better forever. (Last time we spoke, I didn't have this question because I had limited GPU time, only about 4 hours and I just used it all)

  2. The default utterances_per_speaker = 10. I have some speaker with 100 utterances. Does it mean that every time this speaker apears in a batch, random 10 utterances from a 100 will be selected? Or it's going to be same 10 utterances every time? (If the later is true, I'll just make sure that every speaker has exactly 10 utterances and save some disk space). I'm sorry if this question is stupid but the source code is a bit tricky for me.

Thank you a lot!