srvk / eesen

The official repository of the Eesen project
http://arxiv.org/abs/1507.08240
Apache License 2.0
824 stars 342 forks source link

Missing labels in training / decoding in tf_clean branch #169

Open martiansideofthemoon opened 6 years ago

martiansideofthemoon commented 6 years ago

Hello, I am trying to run the TensorFlow based EESEN setup for Switchboard. More specifically, I am using the tf_clean branch and trying to run the asr_egs/swbd/v1-tf/run_ctc_char.sh script. I am having some trouble with the training and decoding steps, would appreciate your help! @ramonsanabria , @fmetze

During the stage 3 (training), I get a number of error messages of the form -

********************************************************************************
********************************************************************************
Warning: sw02018-B_012508-012721 has not been found in labels file: /scratch/tmp.1hi5uR4EIR/labels.cv
********************************************************************************
********************************************************************************

Here are the training logs that follow. I suspect creating tr_y from scratch is a problem?

cleaning done: /scratch/tmp.1hi5uR4EIR/cv_local.scp
original scp length: 4000
scp deleted: 270
final scp length: 3730
number of labels not found: 270
TRAINING STARTS [2018-Jan-28 06:02:05]
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
2.7.13 |Anaconda 4.3.1 (64-bit)| (default, Dec 20 2016, 23:09:15) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
('now:', 'Sun 2018-01-28 06:02:08')
('tf:', '1.1.0')
('cwd:', '/share/data/lang/users/kalpesh/eesen/asr_egs/swbd/v1-tf')
('library:', '/share/data/lang/users/kalpesh/eesen')
('git:', 'heads/master-dirty')
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
reading training set
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
tr_x:
--------------------------------------------------------------------------------
non augmented (mix) training set found for language: no_name_language ... 

preparing dictionary for no_name_language...

ordering all languages (from scratch) train batches... 

Augmenting data x3 and win 3...

--------------------------------------------------------------------------------
tr_y:
--------------------------------------------------------------------------------
creating tr_y from scratch...
unilanguage setup detected (in labels)... 

--------------------------------------------------------------------------------
cv_x:
--------------------------------------------------------------------------------
unilingual set up detected on test or  set language... 

cv (feats) found for language: no_name_language ... 

preparing dictionary for no_name_language...

ordering all languages (from scratch) cv batches... 

Augmenting data x3 and win 3...

--------------------------------------------------------------------------------
cv_y:
--------------------------------------------------------------------------------
creating cv_y from scratch...
unilanguage setup detected (in labels)... 

languages checked ...
(cv_x vs cv_y vs tr_x vs tr_y)

Finally here are my decoding logs -

(python2.7_tf1.4) kalpesh@kalpesh:v1-tf$ ./run_ctc_char.sh 
=====================================================================
                   Decoding eval200 using AM                      
=====================================================================
./steps/decode_ctc_am_tf.sh --config exp/train_char_l4_c320_mdeepbilstm_w3_nfalse/model/config.pkl --data ./data/eval2000/ --weights exp/train_char_l4_c320_mdeepbilstm_w3_nfalse/model/epoch25.ckpt --results exp/train_char_l4_c320_mdeepbilstm_w3_nfalse/results/epoch25
exp/train_char_l4_c320_mdeepbilstm_w3_nfalse/model/config.pkl: line 135: syntax error: unexpected end of file
copy-feats 'ark,s,cs:apply-cmvn --norm-vars=true --utt2spk=ark:./data/eval2000//utt2spk scp:./data/eval2000//cmvn.scp scp:./data/eval2000//feats.scp ark:- |' ark,scp:/scratch/tmp.GgS1if0Wex/f.ark,/scratch/tmp.GgS1if0Wex/test_local.scp 
apply-cmvn --norm-vars=true --utt2spk=ark:./data/eval2000//utt2spk scp:./data/eval2000//cmvn.scp scp:./data/eval2000//feats.scp ark:- 
LOG (apply-cmvn[5.3.85~1-35950]:main():apply-cmvn.cc:159) Applied cepstral mean and variance normalization to 4458 utterances, errors on 0
LOG (copy-feats[5.3.85~1-35950]:main():copy-feats.cc:143) Copied 4458 feature matrices.
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
2.7.13 |Anaconda 4.3.1 (64-bit)| (default, Dec 20 2016, 23:09:15) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
('now:', 'Sun 2018-01-28 06:05:28')
('tf:', '1.1.0')
('cwd:', '/share/data/lang/users/kalpesh/eesen/asr_egs/swbd/v1-tf')
('library:', '/share/data/lang/users/kalpesh/eesen')
('git:', 'heads/master-dirty')
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
reading testing set
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
test_x:
--------------------------------------------------------------------------------
unilingual set up detected on test or  set language... 

test (feats) found for language: no_name_language ... 

preparing dictionary for no_name_language...

ordering all languages (from scratch) test batches... 

Augmenting data x3 and win 3...

--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
test_y (for ter computation):
--------------------------------------------------------------------------------
unilanguage setup detected (in labels)... 

no label files fins in /scratch/tmp.GgS1if0Wex with info_set: test
file: /share/data/lang/users/kalpesh/eesen/tf/ctc-am/reader/labels_reader/labels_reader.py function: __read_one_language line: 171
exiting...

Here are my logs from the first two stages (data preparation, fbank generation)

(python2.7_tf1.4) kalpesh@kalpesh:v1-tf$ ./run_ctc_char.sh 
=====================================================================
                       Data Preparation                            
=====================================================================
Switchboard-1 data preparation succeeded.
utils/fix_data_dir.sh: filtered data/train/segments from 264333 to 264072 lines based on filter /scratch/tmp.V26jBobg4D/recordings.
utils/fix_data_dir.sh: filtered /scratch/tmp.V26jBobg4D/speakers from 4876 to 4870 lines based on filter data/train/cmvn.scp.
utils/fix_data_dir.sh: filtered data/train/spk2utt from 4876 to 4870 lines based on filter /scratch/tmp.V26jBobg4D/speakers.
fix_data_dir.sh: kept 263890 utterances out of 264072
fix_data_dir.sh: old files are kept in data/train/.backup
Can't open data/local/dict_phn/lexicon1.txt: No such file or directory at local/swbd1_map_words.pl line 26.
Character-based dictionary (word spelling) preparation succeeded
Warning: for utterances en_4910-B_013563-013763 and en_4910-B_013594-013790, segments already overlap; leaving these times unchanged.
Warning: for utterances en_4910-B_025539-025791 and en_4910-B_025541-025674, segments already overlap; leaving these times unchanged.
Warning: for utterances en_4910-B_032263-032658 and en_4910-B_032299-032406, segments already overlap; leaving these times unchanged.
Warning: for utterances en_4910-B_035678-035757 and en_4910-B_035715-035865, segments already overlap; leaving these times unchanged.
Data preparation and formatting completed for Eval 2000
(but not MFCC extraction)
fix_data_dir.sh: kept 4458 utterances out of 4466
fix_data_dir.sh: old files are kept in data/eval2000/.backup
=====================================================================
                    FBank Feature Generation                       
=====================================================================
steps/make_fbank.sh --cmd run.pl --nj 32 data/train exp/make_fbank_pitch/train fbank_pitch
steps/make_fbank.sh: moving data/train/feats.scp to data/train/.backup
utils/validate_data_dir.sh: Successfully validated data-directory data/train
steps/make_fbank.sh [info]: segments file exists: using that.
Succeeded creating filterbank features for train
steps/compute_cmvn_stats.sh data/train exp/make_fbank_pitch/train fbank_pitch
Succeeded creating CMVN stats for train
fix_data_dir.sh: kept all 263890 utterances.
fix_data_dir.sh: old files are kept in data/train/.backup
steps/make_fbank.sh --cmd run.pl --nj 10 data/eval2000 exp/make_fbank_pitch/eval2000 fbank_pitch
steps/make_fbank.sh: moving data/eval2000/feats.scp to data/eval2000/.backup
utils/validate_data_dir.sh: Successfully validated data-directory data/eval2000
steps/make_fbank.sh [info]: segments file exists: using that.
Succeeded creating filterbank features for eval2000
steps/compute_cmvn_stats.sh data/eval2000 exp/make_fbank_pitch/eval2000 fbank_pitch
Succeeded creating CMVN stats for eval2000
fix_data_dir.sh: kept all 4458 utterances.
fix_data_dir.sh: old files are kept in data/eval2000/.backup
utils/subset_data_dir.sh: reducing #utt from 263890 to 4000
utils/subset_data_dir.sh: reducing #utt from 263890 to 259890
utils/subset_data_dir.sh: reducing #utt from 259890 to 100000
Reduced number of utterances from 100000 to 76615
Using fix_data_dir.sh to reconcile the other files.
fix_data_dir.sh: kept 76615 utterances out of 100000
fix_data_dir.sh: old files are kept in data/train_100k_nodup/.backup
Reduced number of utterances from 259890 to 192701
Using fix_data_dir.sh to reconcile the other files.
fix_data_dir.sh: kept 192701 utterances out of 259890
fix_data_dir.sh: old files are kept in data/train_nodup/.backup
fmetze commented 6 years ago

Kalpesh,

Ramon would know best about the “v1-tf” recipe, but I can see that there is an error message that says "Can't open data/local/dict_phn/lexicon1.txt: No such file or directory at local/swbd1_map_words.pl line 26.”, which shows that you did not run the “phn” recipe before running the “char” recipe. You need to do this, so that both of them use the same vocabulary. Next, you can configure the location of the temp folder in path.sh, and you want to change it to “/tmp” or something, if you don’t have “/scratch”, which is the default in our cluster. There is also "exp/train_char_l4_c320_mdeepbilstm_w3_nfalse/model/config.pkl: line 135: syntax error: unexpected end of file” - which means maybe the training didn’t start correctly, or not at all?

Let me know if you have any other questions!

Florian

On Jan 28, 2018, at 7:07 AM, Kalpesh Krishna notifications@github.com wrote:

Hello, I am trying to run the TensorFlow based EESEN setup for Switchboard. More specifically, I am using the tf_clean branch and trying to run the asr_egs/swbd/v1-tf/run_ctc_char.sh script. I am having some trouble with the training and decoding steps, would appreciate your help! @ramonsanabria https://github.com/ramonsanabria , @fmetze https://github.com/fmetze During the stage 3 (training), I get a number of error messages of the form -



Warning: sw02018-B_012508-012721 has not been found in labels file: /scratch/tmp.1hi5uR4EIR/labels.cv



Here are the training logs that follow. I suspect creating tr_y from scratch is a problem?

cleaning done: /scratch/tmp.1hi5uR4EIR/cv_local.scp original scp length: 4000 scp deleted: 270 final scp length: 3730 number of labels not found: 270 TRAINING STARTS [2018-Jan-28 06:02:05]


2.7.13 |Anaconda 4.3.1 (64-bit)| (default, Dec 20 2016, 23:09:15) [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] ('now:', 'Sun 2018-01-28 06:02:08') ('tf:', '1.1.0') ('cwd:', '/share/data/lang/users/kalpesh/eesen/asr_egs/swbd/v1-tf') ('library:', '/share/data/lang/users/kalpesh/eesen') ('git:', 'heads/master-dirty')


reading training set


tr_x:

non augmented (mix) training set found for language: no_name_language ...

preparing dictionary for no_name_language...

ordering all languages (from scratch) train batches...

Augmenting data x3 and win 3...


tr_y:

creating tr_y from scratch... unilanguage setup detected (in labels)...


cv_x:

unilingual set up detected on test or set language...

cv (feats) found for language: no_name_language ...

preparing dictionary for no_name_language...

ordering all languages (from scratch) cv batches...

Augmenting data x3 and win 3...


cv_y:

creating cv_y from scratch... unilanguage setup detected (in labels)...

languages checked ... (cv_x vs cv_y vs tr_x vs tr_y) Finally here are my decoding logs -

(python2.7_tf1.4) kalpesh@kalpesh:v1-tf$ ./run_ctc_char.sh

               Decoding eval200 using AM                      

===================================================================== ./steps/decode_ctc_am_tf.sh --config exp/train_char_l4_c320_mdeepbilstm_w3_nfalse/model/config.pkl --data ./data/eval2000/ --weights exp/train_char_l4_c320_mdeepbilstm_w3_nfalse/model/epoch25.ckpt --results exp/train_char_l4_c320_mdeepbilstm_w3_nfalse/results/epoch25 exp/train_char_l4_c320_mdeepbilstm_w3_nfalse/model/config.pkl: line 135: syntax error: unexpected end of file copy-feats 'ark,s,cs:apply-cmvn --norm-vars=true --utt2spk=ark:./data/eval2000//utt2spk scp:./data/eval2000//cmvn.scp scp:./data/eval2000//feats.scp ark:- |' ark,scp:/scratch/tmp.GgS1if0Wex/f.ark,/scratch/tmp.GgS1if0Wex/test_local.scp apply-cmvn --norm-vars=true --utt2spk=ark:./data/eval2000//utt2spk scp:./data/eval2000//cmvn.scp scp:./data/eval2000//feats.scp ark:- LOG (apply-cmvn[5.3.85~1-35950]:main():apply-cmvn.cc:159) Applied cepstral mean and variance normalization to 4458 utterances, errors on 0 LOG (copy-feats[5.3.85~1-35950]:main():copy-feats.cc:143) Copied 4458 feature matrices.


2.7.13 |Anaconda 4.3.1 (64-bit)| (default, Dec 20 2016, 23:09:15) [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] ('now:', 'Sun 2018-01-28 06:05:28') ('tf:', '1.1.0') ('cwd:', '/share/data/lang/users/kalpesh/eesen/asr_egs/swbd/v1-tf') ('library:', '/share/data/lang/users/kalpesh/eesen') ('git:', 'heads/master-dirty')


reading testing set


test_x:

unilingual set up detected on test or set language...

test (feats) found for language: no_name_language ...

preparing dictionary for no_name_language...

ordering all languages (from scratch) test batches...

Augmenting data x3 and win 3...



test_y (for ter computation):

unilanguage setup detected (in labels)...

no label files fins in /scratch/tmp.GgS1if0Wex with info_set: test file: /share/data/lang/users/kalpesh/eesen/tf/ctc-am/reader/labels_reader/labels_reader.py function: __read_one_language line: 171 exiting... Here are my logs from the first two stages (data preparation, fbank generation)

(python2.7_tf1.4) kalpesh@kalpesh:v1-tf$ ./run_ctc_char.sh

                   Data Preparation                            

===================================================================== Switchboard-1 data preparation succeeded. utils/fix_data_dir.sh: filtered data/train/segments from 264333 to 264072 lines based on filter /scratch/tmp.V26jBobg4D/recordings. utils/fix_data_dir.sh: filtered /scratch/tmp.V26jBobg4D/speakers from 4876 to 4870 lines based on filter data/train/cmvn.scp. utils/fix_data_dir.sh: filtered data/train/spk2utt from 4876 to 4870 lines based on filter /scratch/tmp.V26jBobg4D/speakers. fix_data_dir.sh: kept 263890 utterances out of 264072 fix_data_dir.sh: old files are kept in data/train/.backup Can't open data/local/dict_phn/lexicon1.txt: No such file or directory at local/swbd1_map_words.pl line 26. Character-based dictionary (word spelling) preparation succeeded Warning: for utterances en_4910-B_013563-013763 and en_4910-B_013594-013790, segments already overlap; leaving these times unchanged. Warning: for utterances en_4910-B_025539-025791 and en_4910-B_025541-025674, segments already overlap; leaving these times unchanged. Warning: for utterances en_4910-B_032263-032658 and en_4910-B_032299-032406, segments already overlap; leaving these times unchanged. Warning: for utterances en_4910-B_035678-035757 and en_4910-B_035715-035865, segments already overlap; leaving these times unchanged. Data preparation and formatting completed for Eval 2000 (but not MFCC extraction) fix_data_dir.sh: kept 4458 utterances out of 4466 fix_data_dir.sh: old files are kept in data/eval2000/.backup

                FBank Feature Generation                       

===================================================================== steps/make_fbank.sh --cmd run.pl --nj 32 data/train exp/make_fbank_pitch/train fbank_pitch steps/make_fbank.sh: moving data/train/feats.scp to data/train/.backup utils/validate_data_dir.sh: Successfully validated data-directory data/train steps/make_fbank.sh [info]: segments file exists: using that. Succeeded creating filterbank features for train steps/compute_cmvn_stats.sh data/train exp/make_fbank_pitch/train fbank_pitch Succeeded creating CMVN stats for train fix_data_dir.sh: kept all 263890 utterances. fix_data_dir.sh: old files are kept in data/train/.backup steps/make_fbank.sh --cmd run.pl --nj 10 data/eval2000 exp/make_fbank_pitch/eval2000 fbank_pitch steps/make_fbank.sh: moving data/eval2000/feats.scp to data/eval2000/.backup utils/validate_data_dir.sh: Successfully validated data-directory data/eval2000 steps/make_fbank.sh [info]: segments file exists: using that. Succeeded creating filterbank features for eval2000 steps/compute_cmvn_stats.sh data/eval2000 exp/make_fbank_pitch/eval2000 fbank_pitch Succeeded creating CMVN stats for eval2000 fix_data_dir.sh: kept all 4458 utterances. fix_data_dir.sh: old files are kept in data/eval2000/.backup utils/subset_data_dir.sh: reducing #utt from 263890 to 4000 utils/subset_data_dir.sh: reducing #utt from 263890 to 259890 utils/subset_data_dir.sh: reducing #utt from 259890 to 100000 Reduced number of utterances from 100000 to 76615 Using fix_data_dir.sh to reconcile the other files. fix_data_dir.sh: kept 76615 utterances out of 100000 fix_data_dir.sh: old files are kept in data/train_100k_nodup/.backup Reduced number of utterances from 259890 to 192701 Using fix_data_dir.sh to reconcile the other files. fix_data_dir.sh: kept 192701 utterances out of 259890 fix_data_dir.sh: old files are kept in data/train_nodup/.backup — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/169, or mute the thread https://github.com/notifications/unsubscribe-auth/AEnA8cGE0tSaYt7E-UWyF9etI7jwKl0Mks5tPGLqgaJpZM4Rvq8j.

martiansideofthemoon commented 6 years ago

Hi @fmetze ,

"Can't open data/local/dict_phn/lexicon1.txt: No such file or directory at local/swbd1_map_words.pl line 26.”

Yes, I hadn't run the ph recipe. This error disappears on doing this. Do I need to run a decoding with the ph recipe too?

There is also "exp/train_char_l4_c320_mdeepbilstm_w3_nfalse/model/config.pkl: line 135: syntax error: unexpected end of file”

This is an irrelevant error, it's happening since a pickle configuration file is sourced in the utils/parse_options.sh script. It does not affect further execution.

which means maybe the training didn’t start correctly, or not at all?

The training did happen successfully. Here are the training logs. As a confirmation, is it usual for the Kaldi setup to discard 270 dev utterances, 11 eval2000 utterances and 973 train utterances due to transcripts like [vocalized-noise]?

for language: no_name_language
following variables will be optimized: 
--------------------------------------------------------------------------------
<tf.Variable 'cudnn_lstm/params:0' shape=<unknown> dtype=float32_ref>
<tf.Variable 'output_layers/output_fc_no_name_language_no_target_name/weights:0' shape=(640, 42) dtype=float32_ref>
<tf.Variable 'output_layers/output_fc_no_name_language_no_target_name/biases:0' shape=(42,) dtype=float32_ref>
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
[2018-02-01 11:31:59] Epoch 1 starting, learning rate: 0.03
[2018-02-01 12:23:40] Epoch 1 finished in 52 minutes
        Train    cost: 86.2, ter: 35.6%, #example: 491721
        Validate cost: 45.4, ter: 24.9%, #example: 11190
('not updating learning rate, parameters', 8, 0.0005)
--------------------------------------------------------------------------------
....
....
[2018-02-02 07:10:09] Epoch 23 starting, learning rate: 0.0005
[2018-02-02 08:05:53] Epoch 23 finished in 56 minutes
        Train    cost: 8.1, ter: 3.4%, #example: 491721
        Validate cost: 37.9, ter: 15.3%, #example: 11190
('not updating learning rate, parameters', 8, 0.0005)
--------------------------------------------------------------------------------

However, the decoding does not seem to budge. Here are the logs. The suspicious lines seem to be no label files fins in /scratch/tmp.jihiXHPJkp with info_set: test and no_name_language. One important point here is that I am starting the bash script directy from the decoding stage (stage 4). Is it necessary to re-run stage 1 or 2 after I have a trained model?

=====================================================================
                   Decoding eval200 using AM                      
=====================================================================
./steps/decode_ctc_am_tf.sh --config exp/train_char_l4_c320_mdeepbilstm_w3_nfalse/model/config.pkl --data ./data/eval2000/ --weights exp/train_char_l4_c320_mdeepbilstm_w3_nfalse/model/epoch14.ckpt --results exp/train_char_l4_c320_mdeepbilstm_w3_nfalse/results/epoch14
exp/train_char_l4_c320_mdeepbilstm_w3_nfalse/model/config.pkl: line 135: syntax error: unexpected end of file
copy-feats 'ark,s,cs:apply-cmvn --norm-vars=true --utt2spk=ark:./data/eval2000//utt2spk scp:./data/eval2000//cmvn.scp scp:./data/eval2000//feats.scp ark:- |' ark,scp:/scratch/tmp.jihiXHPJkp/f.ark,/scratch/tmp.jihiXHPJkp/test_local.scp 
apply-cmvn --norm-vars=true --utt2spk=ark:./data/eval2000//utt2spk scp:./data/eval2000//cmvn.scp scp:./data/eval2000//feats.scp ark:- 
LOG (apply-cmvn[5.3.85~1-35950]:main():apply-cmvn.cc:159) Applied cepstral mean and variance normalization to 4458 utterances, errors on 0
LOG (copy-feats[5.3.85~1-35950]:main():copy-feats.cc:143) Copied 4458 feature matrices.
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
2.7.14 |Anaconda, Inc.| (default, Dec  7 2017, 17:05:42) 
[GCC 7.2.0]
('now:', 'Fri 2018-02-02 09:13:09')
('tf:', '1.4.0-rc1')
('cwd:', '/share/data/lang/users/kalpesh/eesen/asr_egs/swbd/v1-tf')
('library:', '/share/data/lang/users/kalpesh/eesen')
('git:', 'heads/master-dirty')
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
reading testing set
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
test_x:
--------------------------------------------------------------------------------
unilingual set up detected on test or  set language... 

test (feats) found for language: no_name_language ... 

preparing dictionary for no_name_language...

ordering all languages (from scratch) test batches... 

Augmenting data x3 and win 3...

--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
test_y (for ter computation):
--------------------------------------------------------------------------------
unilanguage setup detected (in labels)... 

no label files fins in /scratch/tmp.jihiXHPJkp with info_set: test
file: /share/data/lang/users/kalpesh/eesen/tf/ctc-am/reader/labels_reader/labels_reader.py function: __read_one_language line: 171
exiting...
fmetze commented 6 years ago

Good, not sure about the pickle error, but if you say it does not affect the training, then things should be fine. You should be fine running the test script from stage 4 only for decoding, the data should already be prepared. @ramonsanabria - any ideas about v1-tf here?

ramonsanabria commented 6 years ago

Hi,

The pickle error is irrelevant. The configuration is loaded properly. I will try to remove it as soon as I have time.

@xinjli is cleaning up the swbd recipie. I have some experiments with different char-based units (removing numbers and noises) that for now seems to be improving a bit.

xinjli commented 6 years ago

I also found the issue that char recipe could not run without the phn recipe today. The same issue also happens in the swbd v1 recipe under the master branch. I will prepare a fix for this issue.

martiansideofthemoon commented 6 years ago

Hi @ramonsanabria , @xinjli Any idea about the no label files fins in /scratch/tmp.jihiXHPJkp with info_set: test error I am receiving?

ramonsanabria commented 6 years ago

can you do: find /scratch/tmp.jihiXHPJkp ?

2018-02-07 2:02 GMT-05:00 Kalpesh Krishna notifications@github.com:

Hi @ramonsanabria https://github.com/ramonsanabria , @xinjli https://github.com/xinjli Any idea about the no label files fins in /scratch/tmp.jihiXHPJkp with info_set: test error I am receiving?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/169#issuecomment-363676356, or mute the thread https://github.com/notifications/unsubscribe-auth/AMlwPVkPG-g1fkUMvbZjT0gQE2jwnL93ks5tSUqcgaJpZM4Rvq8j .

martiansideofthemoon commented 6 years ago

@ramonsanabria yes, I can find it.

kalpesh@kalpesh:kalpesh$ ls /scratch/tmp.jihiXHPJkp
f.ark  test_local.scp
kalpesh@kalpesh:kalpesh$

I checked the code, the system searches for a file named labels.test, but fails to find it. I tried to use the ./local/swbd1_prepare_phn_dict_tf.py script to generate the test labels (like in the case of training data), but I obtain an empty labels file. I used to use a special hubscr.pl script to generate a detailed output using the raw decoded transcripts (in my previous setup).

What is the correct way to integrate this script into EESEN?

xinjli commented 6 years ago

I think we need a stage to generate labels.test for testing. It seems that we do not have any script for this now. I think we need something like python ./local/swbd1_prepare_char_dict_tf.py --text_file ./data/train_nodup/text --input_units ./data/local/dict_char/units.txt --output_labels $dir_am/labels.tr --lower_case --ignore_noises || exit 1 After preparing labels.tr and labels.cv

xinjli commented 6 years ago

Probably we can use following code to generate labels.test

python ./local/swbd1_prepare_char_dict_tf.py --text_file ./data/eval2000/text --input_units ./data/local/dict_char/units.txt --output_labels $dir_am/labels.test

eval2000 contains the text we need for evaluation and just replace $dir_am with variable in your environment

ramonsanabria commented 6 years ago

You have:

https://github.com/srvk/eesen/blob/tf_clean/asr_egs/swbd/v1-tf/local/swbd1_prepare_char_dict_tf.py

This script can generate the units.txt. If you put --output_units it will produce the units that you will further use (presumably this will be with you train text). Then, the units produced by this script will be used as --input_units to generate the labels.cv or labels.test.

Not sure which version is there. But I performed some cleaning of swbd that we should discuss.

2018-02-07 16:37 GMT-05:00 Xinjian Li notifications@github.com:

Probably we can use following code to generate labels.test

python ./local/swbd1_prepare_char_dict_tf.py --text_file ./data/eval2000/text --input_units ./data/local/dict_char/units.txt --output_labels $dir_am/labels.test

eval2000 contains the text we need for evaluation and just replace $dir_am with variable in your environment

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/169#issuecomment-363919165, or mute the thread https://github.com/notifications/unsubscribe-auth/AMlwPZjhYxpTHXp48W9oskoSqT3eX1lXks5tShengaJpZM4Rvq8j .

martiansideofthemoon commented 6 years ago

@ramonsanabria could you describe the process you are using the compute the final WER of a trained model?

I guess this is often called "scoring", in the Kaldi setup. Generally, raw transcripts are fed into hubscr.pl to generate a number of detailed output files, with the final SWBD, CH, Combined WER mentioned in a *.lur file.

martiansideofthemoon commented 6 years ago

Hi @ramonsanabria any update on ^? Also, how have you treated the space character? I cannot find an entry for the space in data/local/dict_char/units.txt. (Note I'm referring to the <space> character, not the CTC blank symbol).