srvk / eesen

The official repository of the Eesen project
http://arxiv.org/abs/1507.08240
Apache License 2.0
822 stars 341 forks source link

How to use my Chinese corpus to train a ASR model with the example that is ars_egs/ hkust/v1 #107

Open Sundy1219 opened 7 years ago

Sundy1219 commented 7 years ago

Hello , I am studying the eesen scripts in the directory ars_egs/hkust/v1 now. but I cannot access to LDC2005S15 and LDC2005T32 corpus . Question 1: Is there any way to download it? Question 2: If i want to use my Chinese corpus to train a ASR model . what is my corpus's format such as wav files and transcript ? how to prepare my Chinese corpus ? thanks Looking forward to your reply

riebling commented 7 years ago

On 11/07/2016 01:42 AM, Sundy1219 wrote:

Hello , I am studying the eesen scripts in the directory ars_egs/hkust/v1 now. but I cannot access to LDC2005S15 and LDC2005T32 corpus . Question 1: Is there any way to download it?

Unfortunately these need to be purchased from LDC, they are not open source. You might be permitted to use them if you are part of a university or organization which has LDC membership.

Question 2: If i want to use my Chinese corpus to train a ASR model . what is my corpus's format such as wav files and transcript ? how to prepare my Chinese corpus ? thanks

They should be in a similar format to other LDC corpora. This usually means a folder which contains subfolders dev/ test/ and train/ (split your data roughly at 10%, 10%, 80% accordingly) each of which contains subfolders for audio and transcripts. The audio format is flexible if you use programs such as ffmpeg, sox, avconv, or lame - but usually LDC corpora are in sphere format, and the folder is named sph/ accordingly. The format for the transcripts is STM, which is defined here: http://www1.icsi.berkeley.edu/Speech/docs/sctk-1.2/infmts.htm

Although it is a large download, you can have a look at an example corpus that uses the open source Tedlium data set. The first steps of the Eesen master script run_ctc_phn.sh call the scripts in https://github.com/srvk/eesen/tree/master/asr_egs/tedlium/v1/local to get this data.

You will also need dictionary and language model data for Chinese, such as can be found here: https://sourceforge.net/projects/cmusphinx/files/Acoustic%20and%20Language%20Models/Mandarin/ For Mandarin, a similar set of scripts to the Tedlium experiment exists, e.g. gale_prep_dict.sh and gale_train_lms.sh here: https://github.com/kaldi-asr/kaldi/tree/master/egs/gale_mandarin/s5/local

Looking forward to your reply

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/107, or mute the thread https://github.com/notifications/unsubscribe-auth/ACX11oWL9cTN6bgZ0dbPFL-PPoMK8RjHks5q7shOgaJpZM4Kq4rP.

Eric Riebling Interactive Systems Lab er1k@cs.cmu.edu 407 South Craig St.

Sundy1219 commented 7 years ago

Thank you for your reply. I download the Acoustic and Language model that you specified . but how can open the .DMP files . I tried a lot of methods. It's does not work . Looking forward to your reply @riebling

riebling commented 7 years ago

I'm not really sure what to do with those files, but perhaps they are just a duplicate of data, but in a different format? See: http://cmusphinx.sourceforge.net/wiki/tutoriallm They mention that the DMP format is old and not recommended, but also that there is a Sphinx tool that can be used to convert it to more usable format(s).

On 11/15/2016 12:24 AM, Sundy1219 wrote:

Thank you for your reply. I download the Acoustic and Language model that you specified . but how can open the .DMP files . I tried a lot of methods. It's does not work . Looking forward to your reply @riebling https://github.com/riebling

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/107#issuecomment-260550647, or mute the thread https://github.com/notifications/unsubscribe-auth/ACX11pk0sLqW1YuyyJTM0Zqruyb2AN3sks5q-UITgaJpZM4Kq4rP.

Eric Riebling Interactive Systems Lab er1k@cs.cmu.edu 407 South Craig St.

Sundy1219 commented 7 years ago

Thank you for your reply. I have composed decoding graph TLG.fst that contains token.fst,lexicon.fst and grammar.fst successfully . The CTC label is Chinese character. I want to test this graph with only labels that RNN outputs . Is there any methods or scripts ? Looking forward to your reply @riebling

riebling commented 7 years ago

Eesen has all the scripts you will need, if you look at a full experiment for example asr_egs/tedlium/v2-30ms/run_ctc_phn.sh. The decoding and testing happen at the end, the decode_ctc_lat.sh script decodes using your TLG.fst as input, given the name of the folder which contains TLG.fst and words.txt as an argument, for example data/lang_phn_test.

Let's also assume the audio to decode has been preprocessed by commands like those in the beginning of run_ctc_phn.sh in a folder called data/test, so that the folder contents look something like this:

data/dev: total used in directory 19739 available 1325193452416 drwxr-xr-x 2 riebli27 cmu139 33280 Nov 16 07:51 . drwxr-xr-x 8 riebli27 cmu139 33280 Nov 16 07:53 .. -rw-r--r-- 1 riebli27 cmu139 82 Nov 16 07:51 glm -rw-r--r-- 1 riebli27 cmu139 3264 Nov 16 07:51 reco2file_and_channel -rw-r--r-- 1 riebli27 cmu139 4604384 Nov 16 07:51 segments -rw-r--r-- 1 riebli27 cmu139 2303631 Nov 16 07:51 spk2utt -rw-r--r-- 1 riebli27 cmu139 16208992 Nov 16 07:51 stm -rw-r--r-- 1 riebli27 cmu139 12638157 Nov 16 07:51 text -rw-r--r-- 1 riebli27 cmu139 3033184 Nov 16 07:51 utt2spk -rw-r--r-- 1 riebli27 cmu139 18897 Nov 16 07:51 wav.scp

Then just like in the example, the decoding and testing occur at the end of run_ctc_phn.sh. The command looks like this, and most importantly, calls decode_ctc_lat.sh:

steps/decode_ctc_lat.sh --cmd "$decode_cmd" --nj 11 --beam 17.0 --lattice_beam 8.0 --max-active 5000 --acwt 0.6 \ data/lang_phn_test data/test $dir/decode_test

So your TLG.fst is found by the argument "data/lang_phn_test" and the decode_ctc_lat.sh script assumes the trained RNN can be found in directory $dir, set to a value for example "exp/train_phn_l5_c320" and creates results in a new output folder beneath that, for example exp/train_phn_l5_c320/decode_test/

On 11/15/2016 09:55 PM, Sundy1219 wrote:

Thank you for your reply. I have composed decoding graph TLG.fst that contains token.fst,lexicon.fst and grammar.fst successfully . The CTC label is Chinese character. I want to test this graph with only labels that RNN outputs . Is there any methods or scripts ? Looking forward to your reply @riebling https://github.com/riebling

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/107#issuecomment-260839514, or mute the thread https://github.com/notifications/unsubscribe-auth/ACX11n2kv4G6C11mTXzs7GdP-4aAJtqxks5q-nC1gaJpZM4Kq4rP.

Eric Riebling Interactive Systems Lab er1k@cs.cmu.edu 407 South Craig St.

Sundy1219 commented 7 years ago

Thank you very much . maybe you misunderstand my intention. I just want to test the decoding graph with lexicon and language model. I don't want to test acoustic model. For example, CTC labels that RNN models are letters that is [a-z] and blank . Labels that RNN model outputs are [' h' 'a' 'l' 'l' 'o'] after inputing the wav feature . how to use the decoding graph that contains token.fst,lexicon.fst and language model to get correct labels ['h' 'e' 'l' 'l' 'o'] ? how to test TLG.fst with labels of RNN model outputing ? thanks !!! Looking forward to your reply !!!!

riebling commented 7 years ago

Sorry, my interpretation of "testing" was not what you meant. And also sorry that I don't know of any ways of testing only the TLG.fst decoding graph by itself.

I do know there are two different styles of Eesen CTC experiment, one using symbols that are characters as you describe, and one using phonemes. The only difference being the dictionary contents.

I am starting to understand your question thanks to your examples. It looks like you want a way to test this: given a sequence of phoneme labels (the kind of thing RNN model produces), and given TLG.fst, produce a sequence of characters. Verify that the characters output by this process are correct, according to the dictionary - which contains the correct mapping of words to phonemes.

input: [' h' 'a' 'l' 'l' 'o'] expected output: ['h' 'e' 'l' 'l' 'o'] observed output: ['h' 'a' 'l' 'o'] = test fails -versus- observed output: ['h' 'e' 'l' 'l' 'o'] = test succeeds

Am I close? This must get interesting for Chinese. Which flavor of CTC experiment best fits Chinese, character (v1/run_ctc_char.sh) or phoneme (v1/run_ctc_phn.sh)?

If nothing else, thanks for helping my understanding of this, by trying to understand and explain better to you. Because I am not an expert in the theory, so much as focusing on the mechanics of the systems and simply running the programs and scripts with data.

On 11/17/2016 04:52 AM, Sundy1219 wrote:

Thank you very much . maybe you misunderstand my intention. I just want to test the decoding graph with lexicon and language model. I don't want to test acoustic model. For example, CTC labels that RNN models are letters that is [a-z] and blank . Labels that RNN model outputs are [' h' 'a' 'l' 'l' 'o'] after inputing the wav feature . how to use the decoding graph that contains token.fst,lexicon.fst and language model to get correct labels ['h' 'e' 'l' 'l' 'o'] ? how to test TLG.fst with labels of RNN model outputing ? thanks !!! Looking forward to your reply !!!!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/107#issuecomment-261202784, or mute the thread https://github.com/notifications/unsubscribe-auth/ACX11iL6wsbiY5wCbnbzi8-i87KFKNyLks5q_CP5gaJpZM4Kq4rP.

Eric Riebling Interactive Systems Lab er1k@cs.cmu.edu 407 South Craig St.