Open MNCTTY opened 5 years ago
maybe you should check file _"exp/train_in240_hidden256_e3_lstm_drop0.2_dot_emb512_hidden512_d1_epoch20_norm5_bs32_mli800_mlo150_adam_lr1e-3_mmt0_l21e-5_delta/decode_test_beam30_nbest1ml100/data.json" ?
yes, there are no such file as I said but as I understand, it should be generated at some stage as every other file in that directory it's not and I want to find why: in log all previous stages were with no errors
yeah, it should be generated when decoding started I am running the training proccessing now, and it will be finished tomorrow, I'll see it whether got the same problem.
yep thanks
I found the same problem, but the main reason is not "the file non exist".
in my case, I found a encoding error in _"exp/train_in240_hidden256_e3_lstm_drop0.2_dot_emb512_hidden512_d1_epoch20_norm5_bs128_mli800_mlo150_adam_lr1e-3_mmt0_l21e-5_delta/decode_test_beam30_nbest1ml100/decode.log"
I use "export PYTHONIOENCODING=UTF-8" to fixed it
yep, we found encoding error earlier, and solved similar way so, after fixing encoding error I run all run.sh without any errors?
yes, I am waiting for decoding finished now. recognition result seems okay
how did you run recognition without decoding stage from run.sh?
ok. our news:
we are here SOMEHOW managed to run decoding stage for this we copied data.json from dump/test to folder, where data.json is not found by run.sh plus add encoding with utf-8 in several new places, plus change rec_token_id for token_id, because we thought that it was a typo.
and: stage 4 finally ran successfully and here are what id did say:
karina@karina:~/Listen-Attend-Spell/egs/aishell$ ./run.sh
dictionary: data/lang_1char/train_chars.txt
Stage 4: Decoding
run.pl: job failed, log is in exp/train_in240_hidden256_e3_lstm_drop0.2_dot_emb512_hidden512_d1_epoch1_norm5_bs32_mli800_mlo150_adam_lr1e-3_mmt0_l21e-5_delta/decode_test_beam30_nbest1_ml100/decode.log
2019-09-03 19:05:02,215 (json2trn:24) INFO: reading exp/train_in240_hidden256_e3_lstm_drop0.2_dot_emb512_hidden512_d1_epoch1_norm5_bs32_mli800_mlo150_adam_lr1e-3_mmt0_l21e-5_delta/decode_test_beam30_nbest1_ml100/data.json
2019-09-03 19:05:02,218 (json2trn:28) INFO: reading data/lang_1char/train_chars.txt
2019-09-03 19:05:02,218 (json2trn:37) INFO: writing hyp trn to exp/train_in240_hidden256_e3_lstm_drop0.2_dot_emb512_hidden512_d1_epoch1_norm5_bs32_mli800_mlo150_adam_lr1e-3_mmt0_l21e-5_delta/decode_test_beam30_nbest1_ml100/hyp.trn
2019-09-03 19:05:02,218 (json2trn:38) INFO: writing ref trn to exp/train_in240_hidden256_e3_lstm_drop0.2_dot_emb512_hidden512_d1_epoch1_norm5_bs32_mli800_mlo150_adam_lr1e-3_mmt0_l21e-5_delta/decode_test_beam30_nbest1_ml100/ref.trn
write a CER (or TER) result in exp/train_in240_hidden256_e3_lstm_drop0.2_dot_emb512_hidden512_d1_epoch1_norm5_bs32_mli800_mlo150_adam_lr1e-3_mmt0_l21e-5_delta/decode_test_beam30_nbest1_ml100/result.txt
| SPKR | # Snt # Wrd | Corr Sub Del Ins Err S.Err |
| Sum/Avg | 419 26135 | 100.0 0.0 0.0 0.0 0.0 0.0 |
how it should be interpreted?
how to run recognition for a random new wav and where it write the recognised text?
do you have ANY idea why data.json doesn't being generated in decoding folder by itself?
I would be very grateful, if you could answer any of these questions
ps/ do you have any spaces in your language? :D
I guess: 1、it is not correct for you to copy test/data.json to exp/{...}/data.json. if you do so, the score script would compare test/data.json with exp/{...}/data.json, which are the same in your case. and the result would be 100% correct 2、how to run recognition for a random new wav and where it write the recognised text? you should prepare dump/test/deltatrue/data.json, whcih could be generated from your data dir. look into data preparation script 3、do you have ANY idea why data.json doesn't being generated in decoding folder by itself? maybe it is still the encoding problem 4、and BTW what do you mean "do you have any spaces in your language?“ : )
3, by the way, what do you mean saying 'encoding' - what stage in run.sh represents encoding? I thought there are only decoding - from wav to text. No?
the utterance for nnet training usually contains only one sentence, so there won't be any points or commas in it, if have, should be delete before training.
"encoding" means the language encoding type, like utf-8. when you use chinese, like opening some file which contains chinese, you should always be careful with the encoding.
I managed to run decoding from start to end, the problem was really in encoding (it was need to be added to some other files) but! for some reason results are still 100% corr. I don't understand why is that
by the way, do you know, how to load a pretrained model to further training?
for some reason results are still 100% corr can you show me some example of you train/...../data.json?
how to load a pretrained model to further training I didn't find any train_stage like in kaldi process, so it might not support pre-training
can you show me some example of you train/...../data.json?
you mean, dump/train/deltatrue/data.json ? here it is
and I attach at once dump/test/deltatrue/data.json and data.json from decoding folder, that generated in the process of decoding I renamed them train_data.json, test_data.json and decode_data.json for easy distinguishing in the attachement
looks like you were using the script directly for your own data. рон не отрываясь смотрел на письмо которое уже начало с углов дымиться
in chinese, one syllbale can be one word, like "我"(one token), means "me"; but in your language, "рон" would be splited into "P O H" (three token), maybe you should modify the script to better understand your data, like one word one token (рон as one token).
so, did I understood you correctly? You say that it's better to construct a vocab with lots of tokens that would be meaningful pieces of the language? not predict letters, but predict that pieces.
may be it makes sense, since I can take such vocab from bert for russian
can you say me, please, what files I should search for changes? I mean, if I just put new vocab in a place of old, situation didn't changed, yes?
I am doing some similar work for code-switch recognition, which in english I gona using subword 'BPE' unit, not a letter. for a example, catch --> ca tch , not 'c a t c h'
can you say me, please, what files I should search for changes the script in data preparation, specically in generating data.json
I am doing some similar work for code-switch recognition, which in english I gona using subword 'BPE' unit, not a letter. for a example, catch --> ca tch , not 'c a t c h'
yeah, bert vocab using bpe exactly to construct vocab plus, there are huge complete vocabs for english, maybe you can use them since google had much more data to construct them for russian they are much smaller but still complete enough
ok, I find the real problem of 100% correct results in result.txt
the problem was in json2trn.py file: there were creation of 2 absolute identical files - ref and hyp - in sourse code from decode data.json. But we know, that they must be different - hyp contains predictions of model, ref - things from test data.json I fixed it in my computer code - and result.txt now is correct (has no 100% correctness)
May be it should be fixed and in source code.
okay i've done something wrong: now hyp.trn are being created empty one. Can somebody tell me what files beside json2trn.py are responsible for it's creation? please may be I will find out this tomorrow, but if someone already knows and answer in this time, it will be cool
it will create something like this, from exp/***/decode/data.json
hyn.trn 过 去 的 就 不 要 想 了 (T0055G2375-T0055G2375S0447) 天 气 下 降 注 意 身 体 (T0055G2286-T0055G2286S0457) 浦 中 市 剧 中 人 街 最 儿 我 独 醒 事 已 见 放 (T0055G0915-T0055G0915S0468)
ref.trn 过 去 的 就 不 要 想 了 (T0055G2375-T0055G2375S0447) 天 气 下 降 注 意 身 体 (T0055G2286-T0055G2286S0457 补 充 诗 句 众 人 皆 醉 而 我 独 醒 是 以 见 放 (T0055G0915-T0055G0915S0468)
it's strange that though I have exp/***/decode/data.json , not empty, looks pretty correct, I still got empty hyp.trn but ref.trn is not an empty at all and looks correct one too
@MNCTTY do u have solve your problem? I'm hesitant whether to use the tool
Hi! I managed to train LAS on aishell data without errors. This is the end of the log:
but decoding stage gave an error:
I don't understand why there are no some file in that directory. I thought everything that run.pl need are generated by themself there