igorsitdikov / lid_kaldi

Apache License 2.0
22 stars 6 forks source link

Training & languages #4

Open Mlallena opened 3 years ago

Mlallena commented 3 years ago

If I wanted to add new languages to this program, or train the ones already present, how would I have to do it?

Also, you should update the link at the first instruction - I had to replace "latest" for "1.0.1" so I could download it.

igorsitdikov commented 2 years ago

Thank you. Unfortunately you have to train your own model for new language. Or you can try https://huggingface.co/TalTechNLP/voxlingua107-epaca-tdnn

Mlallena commented 2 years ago

What did you use to train your own model? I'm asking because (unless I missed something) this repository doesn't have any code that is clearly used for training.

igorsitdikov commented 2 years ago

have a look #1

Mlallena commented 2 years ago

Thanks, I'll have a look.

Mlallena commented 2 years ago

OK, I have been checking, and it could work. Thing is, from what you said in #1, the only modification you make would be to the utt2spk file, but where would this file be stored? I'm going to go out on a limb and say that it is stored in a data folder within v2, but the main problem is that the run.sh file doesn't refer to that file. I'd also have to modify which corpus it is trying to target, since the audios are in a different folder.

Any help you can give me would be welcome.

asadullah797 commented 2 years ago

Hi Igor, I am training Kaldi recipe on voxlingua data for language identification task but I could not find trials file. Can you please share with me the trials file. Many thanks.

igorsitdikov commented 2 years ago

Hello @asadullah797. You can generate file on your own. It will look something like this:

lang-id-A utt-id-A target lang-id-A utt-id-B nontarget lang-id-A utt-id-C nontarget lang-id-B utt-id-A nontarget lang-id-B utt-id-B target

for 3 files and 3 languages:

en utt-en target en utt-ru nontarget en utt-pl nontarget ru utt-en nontarget ru utt-ru target ru utt-pl nontarget pl utt-en nontarget pl utt-ru nontarget pl utt-pl target

Sorry I don't remember, probably columns 1 and 2 should be swapped

asadullah797 commented 2 years ago

For lang id task; how can you define

lang-id-A utt-id-B nontarget

I mean how can you decide whether the given utterance is target/non-target. Thanks

igorsitdikov commented 2 years ago

you have dataset with 3 languages, each wav file has only one language, you should have map wav file - language, so it will be target. all other 3 languages will be nontarget for the file.

asadullah797 commented 2 years ago

Just to confirm; (wav1:>en, wav2:>es, wav3:>de) en wav1 target es wav1 nontarget de wav1 nontarget and so on for other cases as well.

igorsitdikov commented 2 years ago

I think so. But as I wrote before, if it will not work, try to swap columns 1 and 2 like this. Sorry really don't remember. wav1 en target wav1 es nontarget wav1 de nontarget

asadullah73-ce commented 2 years ago

Hi Igor; I have prepared trials file using (https://github.com/kaldi-asr/kaldi/blob/master/egs/aishell/v1/local/produce_trials.py) but at the end of the script I am getting this kind of error: Key de071xs-uBRZoU__S10---0150.960-0167.120 not present in training iVectors The key is the utterance_id in above. Please note that I have created trials file from test data utt2spk.