Closed efosler closed 6 years ago
great yes thanks @efosler we really couldn't get a full recipe for tf + wfst. I have some spare scripts but nothing clean and oficial...
This is correct yes:
It looks like to me that run_ctc_phn.sh creates a set of training/cv labels that ignores noises (and overwrites units.txt, removing spn and npn). However, utils/ctc_compile_dict_token.sh assumes that units.txt and lexicon.txt are synchronized, resulting in the lovely error:
I was thinking if is there any improved (or more simple way) to do the data preparation for characters and phonemes separately. Do you have any thought? I can try to help. Otherwise we can reuse the preparation of the master branch.
Also, another random thing that I observed with swbd: I tried to prepare the char set up substituting numbers by written words and removing noises but at the end it did not work out...
I am working in integrating CharRNN decoding recipe that we have (it doesn't perfrom better than wfst but we allow open vocabulary) https://arxiv.org/abs/1708.04469.
Please let me know if I can help you somehow I am will be very happy to!
Thanks again!
Let me think about it as I play with the scripts. I just created a _tf version of swbd1_decode_graph.sh which gets rid of the -unk option, but that feels like there could be a better factorization.
So, an update: the good news is that I was able to get a decode to run all the way through. There does seem to be a bit of underperformance w.r.t. Yajie's runs on the non-tf version. Currently, I'm seeing 24.7% WER on eval2000 (vs. 21.0) using SWB + Fisher LM. I think there are a few differences:
I'm sure that there is some other possible set of differences in parameters as well.
Just to check: what I did was just work with the output of ./steps/decode_ctc_am_tf.sh and feed the logprobs through latgen-faster. NB this just runs test.py in ctc-am rather than nnet.py or anything else (not sure if this is the right thing to do, but it's what's checked in).
Any thoughts on diffs between the tf and old versions that might be causing the discrepancy?
Hi Eric,
Thank you very much for that. Can you let me know which token error rate were you getting in the acoustic model? I have some experiments that achieved way below this WER. Did you use the prior probabilities of each character during wfst decoding? Can you share the complete log of the acoustic model training?
Thanks!
Best, Ramon
2018-06-26 12:24 GMT-04:00 Eric Fosler-Lussier notifications@github.com:
So, an update: the good news is that I was able to get a decode to run all the way through. There does seem to be a bit of underperformance w.r.t. Yajie's runs on the non-tf version. Currently, I'm seeing 24.7% WER on eval2000 (vs. 21.0) using SWB + Fisher LM. I think there are a few differences:
- 4 layer BiLSTM vs 5 layer BiLSTM
- I'm not sure that the default tf recipe currently checked in has speaker adaptation
I'm sure that there is some other possible set of differences in parameters as well.
Just to check: what I did was just work with the output of ./steps/decode_ctc_am_tf.sh and feed the logprobs through latgen-faster. NB this just runs test.py in ctc-am rather than nnet.py or anything else (not sure if this is the right thing to do, but it's what's checked in).
Any thoughts on diffs between the tf and old versions that might be causing the discrepancy?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/193#issuecomment-400375570, or mute the thread https://github.com/notifications/unsubscribe-auth/AMlwPRMl704W1O4LrlphOdE-Xoarzhciks5uAmA4gaJpZM4U2hdg .
Let me re-run it - I just started up the non-tf version and I think it blew away the tf version (hmpf). I am pretty sure that it isn't using the prior probabilities of each phone (not character) but I'm not sure. (I don't see where that would have been injected into the system).
@efosler I would also like to integrate the tf acoustic model into a WFST for decoding. As I understand this thread you have managed to do that. Is any of your code in the repo ?
I pulled tf_clean and asr_egs/swbd/v1-tf/run_ctc_phn.sh only does acoustic decoding.
Would be great if I could avoid starting from scratch :)
Sorry for the delay - I had a few other things pop up. The non-tf run didn't finish before we had a massive server shutdown because of a planned power outage (sigh). So @ericbolo let me try to run the v1-tf branch again and I can check in against my copy of the repository. I think that @ramonsanabria has had a better outcome than I have.
Basically, the things I had to do were slight modifications to the building the TLG graph followed by calling latgen-faster and score_sclite.sh. I'm sure that the decoding parameters aren't right and I have to investigate whether I do have the priors involved or not before decoding.
@efosler, thank you !
On 29 June 2018 at 18:28, Eric Fosler-Lussier notifications@github.com wrote:
Sorry for the delay - I had a few other things pop up. The non-tf run didn't finish before we had a massive server shutdown because of a planned power outage (sigh). So @ericbolo https://github.com/ericbolo let me try to run the v1-tf branch again and I can check in against my copy of the repository. I think that @ramonsanabria https://github.com/ramonsanabria has had a better outcome than I have.
Basically, the things I had to do were slight modifications to the building the TLG graph followed by calling latgen-faster and score_sclite.sh. I'm sure that the decoding parameters aren't right and I have to investigate whether I do have the priors involved or not before decoding.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/193#issuecomment-401406139, or mute the thread https://github.com/notifications/unsubscribe-auth/AEXQ_KvFiDQlFEOibqPA0fxwQrwXJqzqks5uBlXJgaJpZM4U2hdg .
-- Eric Bolo CTO tel: 06 29 72 19 80 http://06%2058%2079%2025%2037/
Batvoice Technologies 10, rue Coquillière - 75001 Paris www.batvoice.com
Hi all,
Yesterday I was able to re run one of our complete pipeline of EESEN + WFST using BPE units ( https://arxiv.org/pdf/1712.06855.pdf) in SWBD. I hit the 16.5 without fine tunning. I feel I can maybe get some extra points by just playing a bit with WFST parameters.
The BPE pipeline made me think that in the case you need a little bit more speed during decoding you can also play use BPE (bigger units means less steps in the decoding process) and the accuracy is not that bad.
PS: Also we have a recipe to train the acoustic model with the whole fisher swbd corpus set in case you needed for the your real time implementation that you have in mind.
Thanks!
Best, Ramon
2018-07-02 4:52 GMT-04:00 ericbolo notifications@github.com:
@efosler, thank you !
On 29 June 2018 at 18:28, Eric Fosler-Lussier notifications@github.com wrote:
Sorry for the delay - I had a few other things pop up. The non-tf run didn't finish before we had a massive server shutdown because of a planned power outage (sigh). So @ericbolo https://github.com/ericbolo let me try to run the v1-tf branch again and I can check in against my copy of the repository. I think that @ramonsanabria https://github.com/ ramonsanabria has had a better outcome than I have.
Basically, the things I had to do were slight modifications to the building the TLG graph followed by calling latgen-faster and score_sclite.sh. I'm sure that the decoding parameters aren't right and I have to investigate whether I do have the priors involved or not before decoding.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/193#issuecomment-401406139, or mute the thread https://github.com/notifications/unsubscribe-auth/AEXQ_ KvFiDQlFEOibqPA0fxwQrwXJqzqks5uBlXJgaJpZM4U2hdg .
-- Eric Bolo CTO tel: 06 29 72 19 80 http://06%2058%2079%2025%2037/
Batvoice Technologies 10, rue Coquillière - 75001 Paris https://maps.google.com/?q=10,+rue+Coquilli%C3%A8re+-+75001+Paris&entry=gmail&source=g www.batvoice.com
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/193#issuecomment-401719312, or mute the thread https://github.com/notifications/unsubscribe-auth/AMlwPavmS82ay9tCjDizst6ZUnw4DkXqks5uCd9YgaJpZM4U2hdg .
sorry wrong paper. Correction, the BPE paper is this one: https://arxiv.org/pdf/1712.06855.pdf
2018-07-02 9:07 GMT-04:00 Ramon Sanabria ramon.sanabria.teixidor@gmail.com :
Hi all,
Yesterday I was able to re run one of our complete pipeline of EESEN + WFST using BPE units (https://arxiv.org/abs/1708.04469) in SWBD. I hit the 16.5 without fine tunning. I feel I can maybe get some extra points by just playing a bit with WFST parameters.
The BPE pipeline made me think that in the case you need a little bit more speed during decoding you can also play use BPE (bigger units means less steps in the decoding process) and the accuracy is not that bad.
PS: Also we have a recipe to train the acoustic model with the whole fisher swbd corpus set in case you needed for the your real time implementation that you have in mind.
Thanks!
Best, Ramon
2018-07-02 4:52 GMT-04:00 ericbolo notifications@github.com:
@efosler, thank you !
On 29 June 2018 at 18:28, Eric Fosler-Lussier notifications@github.com wrote:
Sorry for the delay - I had a few other things pop up. The non-tf run didn't finish before we had a massive server shutdown because of a planned power outage (sigh). So @ericbolo https://github.com/ericbolo let me try to run the v1-tf branch again and I can check in against my copy of the repository. I think that @ramonsanabria https://github.com/ramonsanab ria has had a better outcome than I have.
Basically, the things I had to do were slight modifications to the building the TLG graph followed by calling latgen-faster and score_sclite.sh. I'm sure that the decoding parameters aren't right and I have to investigate whether I do have the priors involved or not before decoding.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/193#issuecomment-401406139, or mute the thread https://github.com/notifications/unsubscribe-auth/AEXQ_KvFi DQlFEOibqPA0fxwQrwXJqzqks5uBlXJgaJpZM4U2hdg .
-- Eric Bolo CTO tel: 06 29 72 19 80 http://06%2058%2079%2025%2037/
Batvoice Technologies 10, rue Coquillière - 75001 Paris https://maps.google.com/?q=10,+rue+Coquilli%C3%A8re+-+75001+Paris&entry=gmail&source=g www.batvoice.com
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/193#issuecomment-401719312, or mute the thread https://github.com/notifications/unsubscribe-auth/AMlwPavmS82ay9tCjDizst6ZUnw4DkXqks5uCd9YgaJpZM4U2hdg .
@ramonsanabria thanks! We've been having some NFS issues here so I haven't gotten my full pass to run. It would be great to have this recipe in the mix. Does this want to go into v1-tf or should there be a v2-tf?
We finally got the NFS issues resolved so I should have the training done by tomorrowish. @ramonsanabria, two questions - 1) I noticed that the default forward pass used epoch 14 (rather than the full model) - was there a reason for that or is that something that I should clean up. (This would be part of the reason for substandard results, but more likely item 2 below...) 2) I do not believe that the decoding is using priors. I see a start on that for ctc-am/tf/tf_test.py but it doesn't seem to do anything with the priors, nor is there a model.priors file built during training (as far as I can tell). Am I missing something?
Hi Eric, Sorry for not responding the last message. We were also a little bit busy in JSALT.
Regarding tf-v2. Yes we can do that, cool idea. @fmetze what do you think? I am still fine tunning everything (WFST and AM) and I should include some code (mostly for the BPE generation) but it would be a good idea. We are preparing a second publication for SLT, after the acceptance we can actually release the whole recipie.
Ok let me send all the parameters the I am using. Can you share your TER results with your configuration? You might find some parameters that are currently not implemented in the master branch (dropout etc.). But with the intersected parameters you should be fine. With this configuration on swbd I remember that @fmetze achieved something close to 11% TER.
target_scheme {'no_name_language': {'no_target_name': 47}} drop_out 0.0 sat_conf {'num_sat_layers': 2, 'continue_ckpt_sat': False, 'sat_stage': 'fine_tune', 'sat _type': 'non_adapted'} init_nproj 80 clip 0.1 nlayer 5 nhidden 320 data_dir /tmp/tmp.EKt4xyU6eX min_lr_rate 0.0005 half_rate 0.5 do_shuf True nepoch 30 grad_opt grad random_seed 15213 model_dir exp/fmetze_test_43j26e/model input_feats_dim 129 batch_size 16 kl_weight 0.0 lstm_type cudnn lr_rate 0.05 model deepbilstm nproj 340 final_nproj 0 half_after 8 train_dir exp/fmetze_test_43j26e/ online_augment_conf {'window': 3, 'subsampling': 3, 'roll': True} clip_norm False l2 0.001 store_model True debug False continue_ckpt half_period 4 force_lr_epoch False batch_norm True
Let me take a look to the bullet point 2 later in the day. This you should use https://github.com/srvk/eesen/blob/tf_clean/tf/ctc-am/test.py
in order to perform testing. It will generate varios versions of the forward pass (log_probs, logits, probs, etc.) with blank in the 0 position. I will need to clean this up so that the script only outputs what is really needed.
After having the log_probs then is when you can just apply normal eesen c++ recipie (i.e. apply wfst to log_probs). I am not sure why my character based WFST is not working. I could make it work with bpe300 and other units but not in characters. I will try to get back to you later on this.
Thanks!
No worries on lag - I think this is going to be an "over several weeks" thing as this isn't first priority for any of us (although high priority overall).
The TER I'm seeing is more around 15% (still training, but I don't see it likely to get much under 15%) - I will see if there are any diffs.
Meanwhile once I get the pipeline to finish, I'll check in a local copy for @ericbolo so that he can play around, since it is a working pipeline even if it isn't efficient or as high accuracy.
Thanks!
Just for the record, here are diffs on config:
nlayer: 4 (vs 5)
input_feats_dim: 120 (vs 129)
batch_size: 32 (vs 16)
lr_rate: 0.005 (vs 0.05)
nproj: 60 (vs 340)
online_augment_conf.roll = False (vs True)
l2: 0.0 (vs 0.001)
batch_norm: False (vs True)
So it's pretty clear that there are some significant differences, and I'd believe the sum total of them could result in a 4% difference in TER (particularly layers, l2, batch norm, lr_rate, and maybe nproj). The really interesting question is what the extra 9 features are - it looks like one additional base feature which has deltas/double-deltas and windowing applied.
{'continue_ckpt': '', 'diff_num_target_ckpt': False, 'force_lr_epoch': False, 'random_seed': 15213, 'debug': False, 'store_model': True, 'data_dir': '/scratch/tmp/fosler/tmp.xJUz4scH4T', 'train_dir': 'exp/train_phn_l4_c320_mdeepbilstm_w3_ntrue_p60_ip80_fp80', 'batch_size': 32, 'do_shuf': True, 'nepoch': 30, 'lr_rate': 0.005, 'min_lr_rate': 0.0005, 'half_period': 4, 'half_rate': 0.5, 'half_after': 8, 'drop_out': 0.0, 'clip_norm': False, 'kl_weight': 0.0, 'model': 'deepbilstm', 'lstm_type': 'cudnn', 'nproj': 60, 'final_nproj': 100, 'init_nproj': 80, 'l2': 0.0, 'nlayer': 4, 'nhidden': 320, 'clip': 0.1, 'batch_norm': False, 'grad_opt': 'grad', 'sat_conf': {'sat_type': 'non_adapted', 'sat_stage': 'fine_tune', 'num_sat_layers': 2, 'continue_ckpt_sat': False}, 'online_augment_conf': {'window': 3, 'subsampling': 3, 'roll': False}, 'input_feats_dim': 120, 'target_scheme': {'no_name_language': {'no_target_name': 43}}, 'model_dir': 'exp/train_phn_l4_c320_mdeepbilstm_w3_ntrue_p60_ip80_fp80/model'}
Perfect yes, so with this parameters you should see an improvement
according to my experience. For input_feats_dim: 120 (vs 129)
just
extract fbank_pitch features (we should also change this in the recipie of
v1-tf). I assume that 'subsampling': 3
is correct right? (this is also
important).
Also those ones that I think are implemented: 'nproj': 60, 'final_nproj': 100, 'init_nproj': 80
. Otherwise I will push code to have them.
Thanks!
2018-07-03 17:08 GMT-04:00 Eric Fosler-Lussier notifications@github.com:
Just for the record, here are diffs on config: nlayer: 4 (vs 5) input_feats_dim: 120 (vs 129) batch_size: 32 (vs 16) lr_rate: 0.005 (vs 0.05) nproj: 60 (vs 340) online_augment_conf.roll = False (vs True) l2: 0.0 (vs 0.001) batch_norm: False (vs True)
So it's pretty clear that there are some significant differences, and I'd believe the sum total of them could result in a 4% difference in TER (particularly layers, l2, batch norm, lr_rate, and maybe nproj). The really interesting question is what the extra 9 features are - it looks like one additional base feature which has deltas/double-deltas and windowing applied.
{'continue_ckpt': '', 'diff_num_target_ckpt': False, 'force_lr_epoch': False, 'random_seed': 15213, 'debug': False, 'store_model': True, 'data_dir': '/scratch/tmp/fosler/tmp.xJUz4scH4T', 'train_dir': 'exp/train_phn_l4_c320_mdeepbilstm_w3_ntrue_p60_ip80_fp80', 'batch_size': 32, 'do_shuf': True, 'nepoch': 30, 'lr_rate': 0.005, 'min_lr_rate': 0.0005, 'half_period': 4, 'half_rate': 0.5, 'half_after': 8, 'drop_out': 0.0, 'clip_norm': False, 'kl_weight': 0.0, 'model': 'deepbilstm', 'lstm_type': 'cudnn', 'nproj': 60, 'final_nproj': 100, 'init_nproj': 80, 'l2': 0.0, 'nlayer': 4, 'nhidden': 320, 'clip': 0.1, 'batch_norm': False, 'grad_opt': 'grad', 'sat_conf': {'sat_type': 'non_adapted', 'sat_stage': 'fine_tune', 'num_sat_layers': 2, 'continue_ckpt_sat': False}, 'online_augment_conf': {'window': 3, 'subsampling': 3, 'roll': False}, 'input_feats_dim': 120, 'target_scheme': {'no_name_language': {'no_target_name': 43}}, 'model_dir': 'exp/train_phn_l4_c320_mdeepbilstm_w3_ntrue_p60_ip80_fp80/model'}
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/193#issuecomment-402292233, or mute the thread https://github.com/notifications/unsubscribe-auth/AMlwPbFOxV0nyRRyx7rdsXBvhPvB1-xYks5uC90xgaJpZM4U2hdg .
Hello all,
@efosler: yes, all I need is a running pipeline, not the best performing one, so that I can look at all the pieces of an online decoding system with tensorflow + wfst.
On 4 July 2018 at 00:22, ramonsanabria notifications@github.com wrote:
Perfect yes, so with this parameters you should see an improvement according to my experience. For
input_feats_dim: 120 (vs 129)
just extract fbank_pitch features (we should also change this in the recipie of v1-tf). I assume that'subsampling': 3
is correct right? (this is also important).Also those ones that I think are implemented:
'nproj': 60, 'final_nproj': 100, 'init_nproj': 80
. Otherwise I will push code to have them.Thanks!
2018-07-03 17:08 GMT-04:00 Eric Fosler-Lussier notifications@github.com:
Just for the record, here are diffs on config: nlayer: 4 (vs 5) input_feats_dim: 120 (vs 129) batch_size: 32 (vs 16) lr_rate: 0.005 (vs 0.05) nproj: 60 (vs 340) online_augment_conf.roll = False (vs True) l2: 0.0 (vs 0.001) batch_norm: False (vs True)
So it's pretty clear that there are some significant differences, and I'd believe the sum total of them could result in a 4% difference in TER (particularly layers, l2, batch norm, lr_rate, and maybe nproj). The really interesting question is what the extra 9 features are - it looks like one additional base feature which has deltas/double-deltas and windowing applied.
{'continue_ckpt': '', 'diff_num_target_ckpt': False, 'force_lr_epoch': False, 'random_seed': 15213, 'debug': False, 'store_model': True, 'data_dir': '/scratch/tmp/fosler/tmp.xJUz4scH4T', 'train_dir': 'exp/train_phn_l4_c320_mdeepbilstm_w3_ntrue_p60_ip80_fp80', 'batch_size': 32, 'do_shuf': True, 'nepoch': 30, 'lr_rate': 0.005, 'min_lr_rate': 0.0005, 'half_period': 4, 'half_rate': 0.5, 'half_after': 8, 'drop_out': 0.0, 'clip_norm': False, 'kl_weight': 0.0, 'model': 'deepbilstm', 'lstm_type': 'cudnn', 'nproj': 60, 'final_nproj': 100, 'init_nproj': 80, 'l2': 0.0, 'nlayer': 4, 'nhidden': 320, 'clip': 0.1, 'batch_norm': False, 'grad_opt': 'grad', 'sat_conf': {'sat_type': 'non_adapted', 'sat_stage': 'fine_tune', 'num_sat_layers': 2, 'continue_ckpt_sat': False}, 'online_augment_conf': {'window': 3, 'subsampling': 3, 'roll': False}, 'input_feats_dim': 120, 'target_scheme': {'no_name_language': {'no_target_name': 43}}, 'model_dir': 'exp/train_phn_l4_c320_mdeepbilstm_w3_ntrue_p60_ip80_fp80/model'}
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/193#issuecomment-402292233, or mute the thread https://github.com/notifications/unsubscribe-auth/ AMlwPbFOxV0nyRRyx7rdsXBvhPvB1-xYks5uC90xgaJpZM4U2hdg .
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/193#issuecomment-402308973, or mute the thread https://github.com/notifications/unsubscribe-auth/AEXQ_JHFGkUzWdTdWwYVqrrxN6k8GQbAks5uC-6PgaJpZM4U2hdg .
-- Eric Bolo CTO tel: 06 29 72 19 80 http://06%2058%2079%2025%2037/
Batvoice Technologies 10, rue Coquillière - 75001 Paris www.batvoice.com
@ericbolo I've uploaded my changes to efosler/eesen so you can grab the newest copy. This should work - there are a few diffs with the graph prep scripts. Here's a list of the files that I changed so that you can just grab them if you want:
My next step will be to try to rework the recipe so that it matches the parameters sent by @ramonsanabria . Once I've got that done and confirmed I'll send a pull request.
NB: the decode script is woefully non-parallel (needs to be fixed), but for the online stuff this won't matter.
@efosler: wonderful, thanks !
I don't have the swbd db but I can adapt it for, say, tedlium.
On Fri, Jul 6, 2018, 12:06 AM Eric Fosler-Lussier notifications@github.com wrote:
@ericbolo https://github.com/ericbolo I've uploaded my changes to efosler/eesen so you can grab the newest copy. This should work - there are a few diffs with the graph prep scripts. Here's a list of the files that I changed so that you can just grab them if you want:
- asr_egs/swbd/v1-tf/local/swbd1_data_prep.sh
- asr_egs/swbd/v1-tf/local/swbd1_decode_graph_tf.sh
- asr_egs/swbd/v1-tf/run_ctc_phn.sh
- asr_egs/swbd/v1/local/swbd1_data_prep.sh [cosmetic changes only to these]
- asr_egs/wsj/steps/decode_ctc_lat_tf.sh
- asr_egs/wsj/steps/train_ctc_tf.sh [python 3 compatibility]
- asr_egs/wsj/utils/ctc_token_fst.py
- asr_egs/wsj/utils/model_topo.py
My next step will be to try to rework the recipe so that it matches the parameters sent by @ramonsanabria https://github.com/ramonsanabria . Once I've got that done and confirmed I'll send a pull request.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/193#issuecomment-402866670, or mute the thread https://github.com/notifications/unsubscribe-auth/AEXQ_EbYf7FANRUU4OBuQuUkC0MBrQRUks5uDo3QgaJpZM4U2hdg .
Hey @ramonsanabria , quick question: you said...
Also those ones that I think are implemented:
'nproj': 60, 'final_nproj': 100, 'init_nproj': 80
. Otherwise I will push code to have them.
Looking through the code base, it seems like these are passed as parameters - will it not do the right thing if those parameters are set?
About to go offline for a bit, so I won't be able to report on the full run, but training with the parameters above (same as @fmetze 's run but with nproj=60, final_nproj=100, init_nproj=80) does get down to 11.7% TER, so I will make those the default with the script going forward. Decoding hasn't happened yet.
@efosler, a quick update: I was able to run the full pipeline with tensorflow + language model decoding on a dummy dataset. Thanks again !
Next steps re:online decoding (#141): implementing a forward-only LSTM, and the loss function for student-teacher learning.
On 7 July 2018 at 19:49, Eric Fosler-Lussier notifications@github.com wrote:
About to go offline for a bit, so I won't be able to report on the full run, but training with the parameters above (same as @fmetze https://github.com/fmetze 's run but with nproj=60, final_nproj=100, init_nproj=80) does get down to 11.7% TER, so I will make those the default with the script going forward. Decoding hasn't happened yet.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/193#issuecomment-403232508, or mute the thread https://github.com/notifications/unsubscribe-auth/AEXQ_DAAIxwLWmIlWqqyQVnLlV6rkaVKks5uEPTAgaJpZM4U2hdg .
-- Eric Bolo CTO tel: 06 29 72 19 80 http://06%2058%2079%2025%2037/
Batvoice Technologies 10, rue Coquillière - 75001 Paris www.batvoice.com
re: priors. As @efosler noted, it seems the priors are not used in the current decoding code.
in tf_test.py:
if(config[constants.CONFIG_TAGS_TEST.USE_PRIORS]):
#TODO we need to add priors also:
#feed_priors={i: y for i, y in zip(model.priors, config["prior"])}
print(config[constants.CONFIG_TAGS_TEST.PRIORS_SCHEME])
model.priors doesn't seem to be generated anywhere, but we can use label.counts to generate it. @fmetze , are priors used in the original (c++) implementation ?
Hi all,
Priors are generated by:
labels=$dir_am/label.counts
gunzip -c $dir_am/labels.tr.gz cat | awk '{line=$0; gsub(" "," 0 ",line);
print line " 0";}' |
/data/ASR5/fmetze/eesen-block-copy/src/decoderbin/analyze-counts
--verbose=1 --binary=false ark:- $labels
Then you can use nnet.py as:
$decode_cmd JOB=1:$nj $mdl/log/decode.JOB.log \
cat $PWD/$mdl/split$nj/JOB/feats.scp \| sort -k 1 \| python utils/nnet.py
--label-counts $labels --temperature $temperature --blank-scale $bkscale \|
\
latgen-faster --max-active=$max_active --max-mem=$max_mem --beam=$beam
--lattice-beam=$lattice_beam \
--acoustic-scale=$acwt --allow-partial=true
--word-symbol-table=$graphdir/words.txt \
$graphdir/TLG.fst ark:- "ark:|gzip -c > $mdl/lat/lat.JOB.gz" || exit 1;
This nnet.py is currently using tensorflow, I have a version that doesn't rely on that. I will push it now. Will keep you posted.
Thanks!
Here the commit of the new asr_egs/wsj/utils/nnet_notf.py
(that does not use tf): 543c9edfe4e601b7f3e1f22feb7c9f64f5430908
Here are the parts of the code that I posted in the previous message (email responds does not support code in the markdown language):
labels=$dir_am/label.counts
gunzip -c $dir_am/labels.tr.gz cat | awk '{line=$0; gsub(" "," 0 ",line);
print line " 0";}' |
/data/ASR5/fmetze/eesen-block-copy/src/decoderbin/analyze-counts
--verbose=1 --binary=false ark:- $labels
$decode_cmd JOB=1:$nj $mdl/log/decode.JOB.log \
cat $PWD/$mdl/split$nj/JOB/feats.scp \| sort -k 1 \| python utils/nnet.py
--label-counts $labels --temperature $temperature --blank-scale $bkscale \|
\
latgen-faster --max-active=$max_active --max-mem=$max_mem --beam=$beam
--lattice-beam=$lattice_beam \
--acoustic-scale=$acwt --allow-partial=true
--word-symbol-table=$graphdir/words.txt \
$graphdir/TLG.fst ark:- "ark:|gzip -c > $mdl/lat/lat.JOB.gz" || exit 1;
So, an update on my progress with SWB (now that I'm getting back to this). I haven't tried out @ramonsanabria 's code above yet.
I'm able to train a SWB system getting 11.8% TER on the CV set (much better than before). However, decoding with this (again not with priors) gives me a 40+% WER - much worse than the previous setup. I'm trying to debug this to understand where things are going wrong.
One thing I tried to do was turn on TER calculation during the forward pass. Had to do some modifications to steps/decode_ctc_am_tf.sh to make it pass the right flags to the test module. However, that was a non-starter it seems - the forward pass just hangs with no errors.
Seems like the next best step would be to just try to switch to @ramonsanabria 's decode strategy and abandon steps/decode_ctc_am_tf.sh?
@ramonsanabria what's a good (rough) value for blank_scale?
@ramonsanabria Now looking through nnet.py (and non-tf version) - this actually takes the output of the net and does the smoothing and priors as a filter, right? The code snippet you have above doesn't actually run the net forward, it seems to me, but would do something funky on the features in feats.scp.
Hi all,
How is it going. A good value for blank scale should be 0.9 and 1.1. But is something that we should play with. Exactly, the nnet.py script will only take the posteriors from the eesen, modify them slightly (add blank scaling, put blank in index-zero so WFST can read it, add temperature to the whole distribution, add priors which will certainly boost WER scores) and finally pipe it to the next script, which I believe it is the WFST decoding.
Will you guys be in India for Interspeech? Would be great to meet :)
2018-08-10 17:40 GMT+01:00 Eric Fosler-Lussier notifications@github.com:
@ramonsanabria https://github.com/ramonsanabria Now looking through nnet.py (and non-tf version) - this actually takes the output of the net and does the smoothing and priors as a filter, right? The code snippet you have above doesn't actually run the net forward, it seems to me, but would do something funky on the features in feats.scp.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/193#issuecomment-412138534, or mute the thread https://github.com/notifications/unsubscribe-auth/AMlwPWRcr8L4FFltzhLFlUa1kbv-ZW0yks5uPbeOgaJpZM4U2hdg .
India, I wish ! But no...
On Fri, Aug 10, 2018, 8:43 PM ramonsanabria notifications@github.com wrote:
Hi all,
How is it going. A good value for blank scale should be 0.9 and 1.1. But is something that we should play with. Exactly, the nnet.py script will only take the posteriors from the eesen, modify them slightly (add blank scaling, put blank in index-zero so WFST can read it, add temperature to the whole distribution, add priors which will certainly boost WER scores) and finally pipe it to the next script, which I believe it is the WFST decoding.
Will you guys be in India for Interspeech? Would be great to meet :)
2018-08-10 17:40 GMT+01:00 Eric Fosler-Lussier notifications@github.com:
@ramonsanabria https://github.com/ramonsanabria Now looking through nnet.py (and non-tf version) - this actually takes the output of the net and does the smoothing and priors as a filter, right? The code snippet you have above doesn't actually run the net forward, it seems to me, but would do something funky on the features in feats.scp.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/193#issuecomment-412138534, or mute the thread < https://github.com/notifications/unsubscribe-auth/AMlwPWRcr8L4FFltzhLFlUa1kbv-ZW0yks5uPbeOgaJpZM4U2hdg
.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/193#issuecomment-412171233, or mute the thread https://github.com/notifications/unsubscribe-auth/AEXQ_K1PrDJkJz6OEpop29iaR5KFcWJoks5uPdRVgaJpZM4U2hdg .
Alas, I won't be in India either. (Maybe I might be able to stop by CMU sometime this semester.)
Update on progress: I wrote a small script to do greedy decoding on a logit/posterior stream and calculate the TER. (Will post this to my repo soonish and then send a pull request.) Found that on the SWB eval2000 test set I was getting 30% TER (this after priors; without priors it is worse). I was slightly puzzled by that, so I decided to calculate the TER on the train_dev set for SWB - I'm getting roughly 21-22% TER. This was a system that was reporting 11.8% TER on the same set during training. So something is rather hinky. Still digging, but if anyone has ideas, let me know.
I think I've enabled tf_train to dump out the forward pass on cv to see what's going on - if there is a difference in the output. Took me a good chunk of the evening. One thing I did run across is that the fp on subsampled data gets averaged in tf_test - it's not clear to me if the TER reported in tf_train is over averaged or (as I suspect) all variants. I don't think this could account for a factor of two in TER though.
FWIW, I think the code would be cleaner if tf_train and tf_test were factorized some - I had to copy a lot of code over and I worry about inconsistencies between them (although they are hooked together through the model class).
Update from yesterday (now that the swbd system has some time to train): the dumped cv ark files do not show the same CTC error rate as the system claims. I am suspecting that the averaging might be doing something weird. Writing down assumptions here and someone can pick this apart:
def greedy_decode(logits): return [i for i,_ in itertools.groupby(logits.argmax(1)) if i>0]
(Now this is making me wonder if the test set was augmented... hmmm...)
Anyway, just to give a sample of the difference in TER:
Reported by tf during training:
Validate cost: 40.4, ter: 27.6%, #example: 11190 Validate cost: 32.9, ter: 22.2%, #example: 11190 Validate cost: 30.2, ter: 21.2%, #example: 11190 Validate cost: 27.8, ter: 19.2%, #example: 11190 Validate cost: 26.8, ter: 18.3%, #example: 11190 Validate cost: 35.0, ter: 23.7%, #example: 11190 Validate cost: 28.4, ter: 19.4%, #example: 11190 Validate cost: 24.8, ter: 17.1%, #example: 11190
Decoding on the averaged stream:
TER = 76690 / 152641 = 50.2 TER = 69108 / 152380 = 45.4 TER = 68611 / 152380 = 45.0 TER = 62259 / 152380 = 40.9 TER = 59838 / 152380 = 39.3 TER = 72821 / 152380 = 47.8 TER = 61498 / 152380 = 40.4 TER = 59800 / 152380 = 39.2
@ramonsanabria and @fmetze can you confirm what the online feature augmentation is doing? I think I misunderstood it in my comments above. (I had visions of other types of augmentation going on but reading the code I think it's simpler than I thought.)
Looking through the code it seems like when you have the subsample and window set to 3, what it's doing is stacking three frames on the input, and making the input 3 times as small. Is it also creating three variants with different shifts? I'm trying to figure out where the averaging would come in later.
OK, I have figured out the discrepancy in output between forward passes and what is recorded by the training regime. tl;dr - the augmentation and averaging code in tf_test.py is at fault and should not be currently trusted. I'm working on a fix.
When training is done with augmentation (in this example, with window 3) 3 different shifted copies are created for training with stacked features. The TER is calculated for each copy (stream) by taking a forward pass and greedy decoding over the logits, then getting edit distance to the labels. The reported TER is over all copies.
At test time, it is not really clear what to do with 3 copies of the same logit stream. The test code (which I've replicated in the forward pass during training) assumes that correct thing to do is to average the logit streams. This would be appropriate for a traditional frame-based NN system. However, in a CTC-based system there is no guarantee of synchronization of outputs, so averaging the streams means that sometimes the blank label will dominate where it should not (for example: if one stream labels greedily "A blank blank", the second "blank A blank" and the third "blank blank A" then the average stream might label "blank blank blank" - causing a deletion).
I verified this by only dumping out the first stream in the averaging rather than the average, and found that the CV TER was identicial to that reported by the trainer. (That's not to say that the decoding was identical, but that the end number was the same.)
Upshot: it's probably best to arbitrarily take one of the streams and use it at test time - although is there a more appropriate combination scheme?
Created new issue for this particular bug. #194
Latest update: Decoding with sw+fish LM, incorporating priors, and fixing the averaging bug leads to 19.2% WER on eval 2000, with the swbd subset getting 13.4% WER (the kaldi triphone based system gets 13.3% WER on the same set, although this may be a more involved model). I think that this is close enough for a baseline to declare victory. I'll clean stuff up and then make a pull request.
Successful full train and decode; I also tested out a run with a slightly larger net (with a bit of improvement). Adding these baselines to the README file.
# CTC Phonemes on the Complete set (with 5 BiLSTM layers) with WFST decode
%WER 12.5 | 1831 21395 | 88.9 7.7 3.4 1.5 12.5 49.6 | exp/train_phn_fbank_pitch_l5_c320_mdeepbilstm_w3_ntrue_p60_ip80_fp80/results_epoch22_bs5.0_sw1_fsh_tgpr/score_8/eval2000.ctm.swbd.filt.sys
%WER 18.3 | 4459 42989 | 83.9 11.7 4.4 2.2 18.3 57.3 | exp/train_phn_fbank_pitch_l5_c320_mdeepbilstm_w3_ntrue_p60_ip80_fp80/results_epoch22_bs4.0_sw1_fsh_tgpr/score_8/eval2000.ctm.filt.sys
%WER 23.9 | 2628 21594 | 79.0 15.5 5.6 2.8 23.9 62.5 | exp/train_phn_fbank_pitch_l5_c320_mdeepbilstm_w3_ntrue_p60_ip80_fp80/results_epoch22_bs4.0_sw1_fsh_tgpr/score_8/eval2000.ctm.callhm.filt.sys
# Slightly larger model (400 units, 80 internal projections) with WFST decode
%WER 12.2 | 1831 21395 | 89.2 7.7 3.1 1.4 12.2 49.7 | exp/train_phn_fbank_pitch_l5_c400_mdeepbilstm_w3_ntrue_p80_ip80_fp80/results_epoch23_bs7.0_sw1_fsh_tgpr/score_10/eval2000.ctm.swbd.filt.sys
%WER 17.8 | 4459 42989 | 84.1 11.1 4.8 1.9 17.8 57.1 | exp/train_phn_fbank_pitch_l5_c400_mdeepbilstm_w3_ntrue_p80_ip80_fp80/results_epoch23_bs7.0_sw1_fsh_tgpr/score_9/eval2000.ctm.filt.sys
%WER 23.4 | 2628 21594 | 79.3 14.8 5.9 2.7 23.4 62.1 | exp/train_phn_fbank_pitch_l5_c400_mdeepbilstm_w3_ntrue_p80_ip80_fp80/results_epoch23_bs7.0_sw1_fsh_tgpr/score_10/eval2000.ctm.callhm.filt.sys
Awesome thank you so much Eric! the numbers looks great. Can you share the full training configuration?
Thank you again!
Just submitted the pull request (#196).
Once we decide that #196 is all good, I think we can close this particular thread!!!
OK, closing this particular thread. Whew!
Decided a new thread would be good for this issue.
Right now the SWB tf code as checked in seems to have a discrepancy, and I'm writing down some of my assumptions as I work through cleaning up the WFST decode.
It looks like to me that run_ctc_phn.sh creates a set of training/cv labels that ignores noises (and overwrites units.txt, removing spn and npn). However, utils/ctc_compile_dict_token.sh assumes that units.txt and lexicon.txt are synchronized, resulting in the lovely error:
The fix is pretty simple (synchronizing the lexicon) but I'm trying to figure out how much to modify the utils/ctc_compile_dict_token.sh script vs. correcting the prep script to do the appropriate correction. I'm thinking that I'll correct the prep script, but if anyone has any thoughts on that let me know.