[tf_clean] swb1/v1-tf WFST decoding - checking on assumptions

efosler commented 6 years ago

Decided a new thread would be good for this issue.

Right now the SWB tf code as checked in seems to have a discrepancy, and I'm writing down some of my assumptions as I work through cleaning up the WFST decode.

It looks like to me that run_ctc_phn.sh creates a set of training/cv labels that ignores noises (and overwrites units.txt, removing spn and npn). However, utils/ctc_compile_dict_token.sh assumes that units.txt and lexicon.txt are synchronized, resulting in the lovely error:

FATAL: FstCompiler: Symbol "spn" is not mapped to any integer arc ilabel, symbol table = data/lang_phn/tokens.txt, source = standard input, line = 1171

The fix is pretty simple (synchronizing the lexicon) but I'm trying to figure out how much to modify the utils/ctc_compile_dict_token.sh script vs. correcting the prep script to do the appropriate correction. I'm thinking that I'll correct the prep script, but if anyone has any thoughts on that let me know.

ramonsanabria commented 6 years ago

great yes thanks @efosler we really couldn't get a full recipe for tf + wfst. I have some spare scripts but nothing clean and oficial...

This is correct yes:

It looks like to me that run_ctc_phn.sh creates a set of training/cv labels that ignores noises (and overwrites units.txt, removing spn and npn). However, utils/ctc_compile_dict_token.sh assumes that units.txt and lexicon.txt are synchronized, resulting in the lovely error:

I was thinking if is there any improved (or more simple way) to do the data preparation for characters and phonemes separately. Do you have any thought? I can try to help. Otherwise we can reuse the preparation of the master branch.

Also, another random thing that I observed with swbd: I tried to prepare the char set up substituting numbers by written words and removing noises but at the end it did not work out...

I am working in integrating CharRNN decoding recipe that we have (it doesn't perfrom better than wfst but we allow open vocabulary) https://arxiv.org/abs/1708.04469.

Please let me know if I can help you somehow I am will be very happy to!

Thanks again!

efosler commented 6 years ago

Let me think about it as I play with the scripts. I just created a _tf version of swbd1_decode_graph.sh which gets rid of the -unk option, but that feels like there could be a better factorization.

efosler commented 6 years ago

So, an update: the good news is that I was able to get a decode to run all the way through. There does seem to be a bit of underperformance w.r.t. Yajie's runs on the non-tf version. Currently, I'm seeing 24.7% WER on eval2000 (vs. 21.0) using SWB + Fisher LM. I think there are a few differences:

4 layer BiLSTM vs 5 layer BiLSTM
I'm not sure that the default tf recipe currently checked in has speaker adaptation (or in fact if the original has speaker adaptation). The RESULTS file seems to indicate speaker adaptation but now looking through v1 I don't see how that happens, if it happens.

I'm sure that there is some other possible set of differences in parameters as well.

Just to check: what I did was just work with the output of ./steps/decode_ctc_am_tf.sh and feed the logprobs through latgen-faster. NB this just runs test.py in ctc-am rather than nnet.py or anything else (not sure if this is the right thing to do, but it's what's checked in).

Any thoughts on diffs between the tf and old versions that might be causing the discrepancy?

ramonsanabria commented 6 years ago

Hi Eric,

Thank you very much for that. Can you let me know which token error rate were you getting in the acoustic model? I have some experiments that achieved way below this WER. Did you use the prior probabilities of each character during wfst decoding? Can you share the complete log of the acoustic model training?

Thanks!

Best, Ramon

2018-06-26 12:24 GMT-04:00 Eric Fosler-Lussier notifications@github.com:

So, an update: the good news is that I was able to get a decode to run all the way through. There does seem to be a bit of underperformance w.r.t. Yajie's runs on the non-tf version. Currently, I'm seeing 24.7% WER on eval2000 (vs. 21.0) using SWB + Fisher LM. I think there are a few differences:

4 layer BiLSTM vs 5 layer BiLSTM

I'm not sure that the default tf recipe currently checked in has speaker adaptation

I'm sure that there is some other possible set of differences in parameters as well.

Just to check: what I did was just work with the output of ./steps/decode_ctc_am_tf.sh and feed the logprobs through latgen-faster. NB this just runs test.py in ctc-am rather than nnet.py or anything else (not sure if this is the right thing to do, but it's what's checked in).

Any thoughts on diffs between the tf and old versions that might be causing the discrepancy?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/193#issuecomment-400375570, or mute the thread https://github.com/notifications/unsubscribe-auth/AMlwPRMl704W1O4LrlphOdE-Xoarzhciks5uAmA4gaJpZM4U2hdg .

efosler commented 6 years ago

Let me re-run it - I just started up the non-tf version and I think it blew away the tf version (hmpf). I am pretty sure that it isn't using the prior probabilities of each phone (not character) but I'm not sure. (I don't see where that would have been injected into the system).

ericbolo commented 6 years ago

@efosler I would also like to integrate the tf acoustic model into a WFST for decoding. As I understand this thread you have managed to do that. Is any of your code in the repo ?

I pulled tf_clean and asr_egs/swbd/v1-tf/run_ctc_phn.sh only does acoustic decoding.

Would be great if I could avoid starting from scratch :)

efosler commented 6 years ago

Sorry for the delay - I had a few other things pop up. The non-tf run didn't finish before we had a massive server shutdown because of a planned power outage (sigh). So @ericbolo let me try to run the v1-tf branch again and I can check in against my copy of the repository. I think that @ramonsanabria has had a better outcome than I have.

Basically, the things I had to do were slight modifications to the building the TLG graph followed by calling latgen-faster and score_sclite.sh. I'm sure that the decoding parameters aren't right and I have to investigate whether I do have the priors involved or not before decoding.

ericbolo commented 6 years ago

@efosler, thank you !

On 29 June 2018 at 18:28, Eric Fosler-Lussier notifications@github.com wrote:

Sorry for the delay - I had a few other things pop up. The non-tf run didn't finish before we had a massive server shutdown because of a planned power outage (sigh). So @ericbolo https://github.com/ericbolo let me try to run the v1-tf branch again and I can check in against my copy of the repository. I think that @ramonsanabria https://github.com/ramonsanabria has had a better outcome than I have.

Basically, the things I had to do were slight modifications to the building the TLG graph followed by calling latgen-faster and score_sclite.sh. I'm sure that the decoding parameters aren't right and I have to investigate whether I do have the priors involved or not before decoding.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/193#issuecomment-401406139, or mute the thread https://github.com/notifications/unsubscribe-auth/AEXQ_KvFiDQlFEOibqPA0fxwQrwXJqzqks5uBlXJgaJpZM4U2hdg .

-- Eric Bolo CTO tel: 06 29 72 19 80 http://06%2058%2079%2025%2037/

Batvoice Technologies 10, rue Coquillière - 75001 Paris www.batvoice.com

ramonsanabria commented 6 years ago

Hi all,

Yesterday I was able to re run one of our complete pipeline of EESEN + WFST using BPE units ( https://arxiv.org/pdf/1712.06855.pdf) in SWBD. I hit the 16.5 without fine tunning. I feel I can maybe get some extra points by just playing a bit with WFST parameters.

The BPE pipeline made me think that in the case you need a little bit more speed during decoding you can also play use BPE (bigger units means less steps in the decoding process) and the accuracy is not that bad.

PS: Also we have a recipe to train the acoustic model with the whole fisher swbd corpus set in case you needed for the your real time implementation that you have in mind.

Thanks!

Best, Ramon

2018-07-02 4:52 GMT-04:00 ericbolo notifications@github.com:

@efosler, thank you !

On 29 June 2018 at 18:28, Eric Fosler-Lussier notifications@github.com wrote:

Sorry for the delay - I had a few other things pop up. The non-tf run didn't finish before we had a massive server shutdown because of a planned power outage (sigh). So @ericbolo https://github.com/ericbolo let me try to run the v1-tf branch again and I can check in against my copy of the repository. I think that @ramonsanabria https://github.com/ ramonsanabria has had a better outcome than I have.

Basically, the things I had to do were slight modifications to the building the TLG graph followed by calling latgen-faster and score_sclite.sh. I'm sure that the decoding parameters aren't right and I have to investigate whether I do have the priors involved or not before decoding.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/193#issuecomment-401406139, or mute the thread https://github.com/notifications/unsubscribe-auth/AEXQ_ KvFiDQlFEOibqPA0fxwQrwXJqzqks5uBlXJgaJpZM4U2hdg .

-- Eric Bolo CTO tel: 06 29 72 19 80 http://06%2058%2079%2025%2037/

Batvoice Technologies 10, rue Coquillière - 75001 Paris https://maps.google.com/?q=10,+rue+Coquilli%C3%A8re+-+75001+Paris&entry=gmail&source=g www.batvoice.com

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/193#issuecomment-401719312, or mute the thread https://github.com/notifications/unsubscribe-auth/AMlwPavmS82ay9tCjDizst6ZUnw4DkXqks5uCd9YgaJpZM4U2hdg .

ramonsanabria commented 6 years ago

sorry wrong paper. Correction, the BPE paper is this one: https://arxiv.org/pdf/1712.06855.pdf

2018-07-02 9:07 GMT-04:00 Ramon Sanabria ramon.sanabria.teixidor@gmail.com :

Hi all,

Yesterday I was able to re run one of our complete pipeline of EESEN + WFST using BPE units (https://arxiv.org/abs/1708.04469) in SWBD. I hit the 16.5 without fine tunning. I feel I can maybe get some extra points by just playing a bit with WFST parameters.

The BPE pipeline made me think that in the case you need a little bit more speed during decoding you can also play use BPE (bigger units means less steps in the decoding process) and the accuracy is not that bad.

PS: Also we have a recipe to train the acoustic model with the whole fisher swbd corpus set in case you needed for the your real time implementation that you have in mind.

Thanks!

Best, Ramon

2018-07-02 4:52 GMT-04:00 ericbolo notifications@github.com:

@efosler, thank you !

On 29 June 2018 at 18:28, Eric Fosler-Lussier notifications@github.com wrote:

Sorry for the delay - I had a few other things pop up. The non-tf run didn't finish before we had a massive server shutdown because of a planned power outage (sigh). So @ericbolo https://github.com/ericbolo let me try to run the v1-tf branch again and I can check in against my copy of the repository. I think that @ramonsanabria https://github.com/ramonsanab ria has had a better outcome than I have.

Basically, the things I had to do were slight modifications to the building the TLG graph followed by calling latgen-faster and score_sclite.sh. I'm sure that the decoding parameters aren't right and I have to investigate whether I do have the priors involved or not before decoding.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/193#issuecomment-401406139, or mute the thread https://github.com/notifications/unsubscribe-auth/AEXQ_KvFi DQlFEOibqPA0fxwQrwXJqzqks5uBlXJgaJpZM4U2hdg .

-- Eric Bolo CTO tel: 06 29 72 19 80 http://06%2058%2079%2025%2037/

Batvoice Technologies 10, rue Coquillière - 75001 Paris https://maps.google.com/?q=10,+rue+Coquilli%C3%A8re+-+75001+Paris&entry=gmail&source=g www.batvoice.com

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/193#issuecomment-401719312, or mute the thread https://github.com/notifications/unsubscribe-auth/AMlwPavmS82ay9tCjDizst6ZUnw4DkXqks5uCd9YgaJpZM4U2hdg .

efosler commented 6 years ago

@ramonsanabria thanks! We've been having some NFS issues here so I haven't gotten my full pass to run. It would be great to have this recipe in the mix. Does this want to go into v1-tf or should there be a v2-tf?

efosler commented 6 years ago

We finally got the NFS issues resolved so I should have the training done by tomorrowish. @ramonsanabria, two questions - 1) I noticed that the default forward pass used epoch 14 (rather than the full model) - was there a reason for that or is that something that I should clean up. (This would be part of the reason for substandard results, but more likely item 2 below...) 2) I do not believe that the decoding is using priors. I see a start on that for ctc-am/tf/tf_test.py but it doesn't seem to do anything with the priors, nor is there a model.priors file built during training (as far as I can tell). Am I missing something?

ramonsanabria commented 6 years ago

Hi Eric, Sorry for not responding the last message. We were also a little bit busy in JSALT.

Regarding tf-v2. Yes we can do that, cool idea. @fmetze what do you think? I am still fine tunning everything (WFST and AM) and I should include some code (mostly for the BPE generation) but it would be a good idea. We are preparing a second publication for SLT, after the acceptance we can actually release the whole recipie.

Ok let me send all the parameters the I am using. Can you share your TER results with your configuration? You might find some parameters that are currently not implemented in the master branch (dropout etc.). But with the intersected parameters you should be fine. With this configuration on swbd I remember that @fmetze achieved something close to 11% TER.

target_scheme {'no_name_language': {'no_target_name': 47}} drop_out 0.0 sat_conf {'num_sat_layers': 2, 'continue_ckpt_sat': False, 'sat_stage': 'fine_tune', 'sat _type': 'non_adapted'} init_nproj 80 clip 0.1 nlayer 5 nhidden 320 data_dir /tmp/tmp.EKt4xyU6eX min_lr_rate 0.0005 half_rate 0.5 do_shuf True nepoch 30 grad_opt grad random_seed 15213 model_dir exp/fmetze_test_43j26e/model input_feats_dim 129 batch_size 16 kl_weight 0.0 lstm_type cudnn lr_rate 0.05 model deepbilstm nproj 340 final_nproj 0 half_after 8 train_dir exp/fmetze_test_43j26e/ online_augment_conf {'window': 3, 'subsampling': 3, 'roll': True} clip_norm False l2 0.001 store_model True debug False continue_ckpt half_period 4 force_lr_epoch False batch_norm True Let me take a look to the bullet point 2 later in the day. This you should use https://github.com/srvk/eesen/blob/tf_clean/tf/ctc-am/test.py in order to perform testing. It will generate varios versions of the forward pass (log_probs, logits, probs, etc.) with blank in the 0 position. I will need to clean this up so that the script only outputs what is really needed.

After having the log_probs then is when you can just apply normal eesen c++ recipie (i.e. apply wfst to log_probs). I am not sure why my character based WFST is not working. I could make it work with bpe300 and other units but not in characters. I will try to get back to you later on this.

Thanks!

efosler commented 6 years ago

No worries on lag - I think this is going to be an "over several weeks" thing as this isn't first priority for any of us (although high priority overall).

The TER I'm seeing is more around 15% (still training, but I don't see it likely to get much under 15%) - I will see if there are any diffs.

Meanwhile once I get the pipeline to finish, I'll check in a local copy for @ericbolo so that he can play around, since it is a working pipeline even if it isn't efficient or as high accuracy.

Thanks!

efosler commented 6 years ago

Just for the record, here are diffs on config: nlayer: 4 (vs 5) input_feats_dim: 120 (vs 129) batch_size: 32 (vs 16) lr_rate: 0.005 (vs 0.05) nproj: 60 (vs 340) online_augment_conf.roll = False (vs True) l2: 0.0 (vs 0.001) batch_norm: False (vs True)

So it's pretty clear that there are some significant differences, and I'd believe the sum total of them could result in a 4% difference in TER (particularly layers, l2, batch norm, lr_rate, and maybe nproj). The really interesting question is what the extra 9 features are - it looks like one additional base feature which has deltas/double-deltas and windowing applied.

{'continue_ckpt': '', 'diff_num_target_ckpt': False, 'force_lr_epoch': False, 'random_seed': 15213, 'debug': False, 'store_model': True, 'data_dir': '/scratch/tmp/fosler/tmp.xJUz4scH4T', 'train_dir': 'exp/train_phn_l4_c320_mdeepbilstm_w3_ntrue_p60_ip80_fp80', 'batch_size': 32, 'do_shuf': True, 'nepoch': 30, 'lr_rate': 0.005, 'min_lr_rate': 0.0005, 'half_period': 4, 'half_rate': 0.5, 'half_after': 8, 'drop_out': 0.0, 'clip_norm': False, 'kl_weight': 0.0, 'model': 'deepbilstm', 'lstm_type': 'cudnn', 'nproj': 60, 'final_nproj': 100, 'init_nproj': 80, 'l2': 0.0, 'nlayer': 4, 'nhidden': 320, 'clip': 0.1, 'batch_norm': False, 'grad_opt': 'grad', 'sat_conf': {'sat_type': 'non_adapted', 'sat_stage': 'fine_tune', 'num_sat_layers': 2, 'continue_ckpt_sat': False}, 'online_augment_conf': {'window': 3, 'subsampling': 3, 'roll': False}, 'input_feats_dim': 120, 'target_scheme': {'no_name_language': {'no_target_name': 43}}, 'model_dir': 'exp/train_phn_l4_c320_mdeepbilstm_w3_ntrue_p60_ip80_fp80/model'}

ramonsanabria commented 6 years ago

Perfect yes, so with this parameters you should see an improvement according to my experience. For input_feats_dim: 120 (vs 129) just extract fbank_pitch features (we should also change this in the recipie of v1-tf). I assume that 'subsampling': 3 is correct right? (this is also important).

Also those ones that I think are implemented: 'nproj': 60, 'final_nproj': 100, 'init_nproj': 80. Otherwise I will push code to have them.

Thanks!

2018-07-03 17:08 GMT-04:00 Eric Fosler-Lussier notifications@github.com:

Just for the record, here are diffs on config: nlayer: 4 (vs 5) input_feats_dim: 120 (vs 129) batch_size: 32 (vs 16) lr_rate: 0.005 (vs 0.05) nproj: 60 (vs 340) online_augment_conf.roll = False (vs True) l2: 0.0 (vs 0.001) batch_norm: False (vs True)

So it's pretty clear that there are some significant differences, and I'd believe the sum total of them could result in a 4% difference in TER (particularly layers, l2, batch norm, lr_rate, and maybe nproj). The really interesting question is what the extra 9 features are - it looks like one additional base feature which has deltas/double-deltas and windowing applied.

{'continue_ckpt': '', 'diff_num_target_ckpt': False, 'force_lr_epoch': False, 'random_seed': 15213, 'debug': False, 'store_model': True, 'data_dir': '/scratch/tmp/fosler/tmp.xJUz4scH4T', 'train_dir': 'exp/train_phn_l4_c320_mdeepbilstm_w3_ntrue_p60_ip80_fp80', 'batch_size': 32, 'do_shuf': True, 'nepoch': 30, 'lr_rate': 0.005, 'min_lr_rate': 0.0005, 'half_period': 4, 'half_rate': 0.5, 'half_after': 8, 'drop_out': 0.0, 'clip_norm': False, 'kl_weight': 0.0, 'model': 'deepbilstm', 'lstm_type': 'cudnn', 'nproj': 60, 'final_nproj': 100, 'init_nproj': 80, 'l2': 0.0, 'nlayer': 4, 'nhidden': 320, 'clip': 0.1, 'batch_norm': False, 'grad_opt': 'grad', 'sat_conf': {'sat_type': 'non_adapted', 'sat_stage': 'fine_tune', 'num_sat_layers': 2, 'continue_ckpt_sat': False}, 'online_augment_conf': {'window': 3, 'subsampling': 3, 'roll': False}, 'input_feats_dim': 120, 'target_scheme': {'no_name_language': {'no_target_name': 43}}, 'model_dir': 'exp/train_phn_l4_c320_mdeepbilstm_w3_ntrue_p60_ip80_fp80/model'}

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/193#issuecomment-402292233, or mute the thread https://github.com/notifications/unsubscribe-auth/AMlwPbFOxV0nyRRyx7rdsXBvhPvB1-xYks5uC90xgaJpZM4U2hdg .

ericbolo commented 6 years ago

Hello all,

@efosler: yes, all I need is a running pipeline, not the best performing one, so that I can look at all the pieces of an online decoding system with tensorflow + wfst.

On 4 July 2018 at 00:22, ramonsanabria notifications@github.com wrote:

Perfect yes, so with this parameters you should see an improvement according to my experience. For input_feats_dim: 120 (vs 129) just extract fbank_pitch features (we should also change this in the recipie of v1-tf). I assume that 'subsampling': 3 is correct right? (this is also important).

Also those ones that I think are implemented: 'nproj': 60, 'final_nproj': 100, 'init_nproj': 80. Otherwise I will push code to have them.

Thanks!

2018-07-03 17:08 GMT-04:00 Eric Fosler-Lussier notifications@github.com:

Just for the record, here are diffs on config: nlayer: 4 (vs 5) input_feats_dim: 120 (vs 129) batch_size: 32 (vs 16) lr_rate: 0.005 (vs 0.05) nproj: 60 (vs 340) online_augment_conf.roll = False (vs True) l2: 0.0 (vs 0.001) batch_norm: False (vs True)

So it's pretty clear that there are some significant differences, and I'd believe the sum total of them could result in a 4% difference in TER (particularly layers, l2, batch norm, lr_rate, and maybe nproj). The really interesting question is what the extra 9 features are - it looks like one additional base feature which has deltas/double-deltas and windowing applied.

{'continue_ckpt': '', 'diff_num_target_ckpt': False, 'force_lr_epoch': False, 'random_seed': 15213, 'debug': False, 'store_model': True, 'data_dir': '/scratch/tmp/fosler/tmp.xJUz4scH4T', 'train_dir': 'exp/train_phn_l4_c320_mdeepbilstm_w3_ntrue_p60_ip80_fp80', 'batch_size': 32, 'do_shuf': True, 'nepoch': 30, 'lr_rate': 0.005, 'min_lr_rate': 0.0005, 'half_period': 4, 'half_rate': 0.5, 'half_after': 8, 'drop_out': 0.0, 'clip_norm': False, 'kl_weight': 0.0, 'model': 'deepbilstm', 'lstm_type': 'cudnn', 'nproj': 60, 'final_nproj': 100, 'init_nproj': 80, 'l2': 0.0, 'nlayer': 4, 'nhidden': 320, 'clip': 0.1, 'batch_norm': False, 'grad_opt': 'grad', 'sat_conf': {'sat_type': 'non_adapted', 'sat_stage': 'fine_tune', 'num_sat_layers': 2, 'continue_ckpt_sat': False}, 'online_augment_conf': {'window': 3, 'subsampling': 3, 'roll': False}, 'input_feats_dim': 120, 'target_scheme': {'no_name_language': {'no_target_name': 43}}, 'model_dir': 'exp/train_phn_l4_c320_mdeepbilstm_w3_ntrue_p60_ip80_fp80/model'}

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/193#issuecomment-402292233, or mute the thread https://github.com/notifications/unsubscribe-auth/ AMlwPbFOxV0nyRRyx7rdsXBvhPvB1-xYks5uC90xgaJpZM4U2hdg .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/193#issuecomment-402308973, or mute the thread https://github.com/notifications/unsubscribe-auth/AEXQ_JHFGkUzWdTdWwYVqrrxN6k8GQbAks5uC-6PgaJpZM4U2hdg .

-- Eric Bolo CTO tel: 06 29 72 19 80 http://06%2058%2079%2025%2037/

Batvoice Technologies 10, rue Coquillière - 75001 Paris www.batvoice.com

efosler commented 6 years ago

@ericbolo I've uploaded my changes to efosler/eesen so you can grab the newest copy. This should work - there are a few diffs with the graph prep scripts. Here's a list of the files that I changed so that you can just grab them if you want:

asr_egs/swbd/v1-tf/local/swbd1_data_prep.sh
asr_egs/swbd/v1-tf/local/swbd1_decode_graph_tf.sh
asr_egs/swbd/v1-tf/run_ctc_phn.sh
asr_egs/swbd/v1/local/swbd1_data_prep.sh [cosmetic changes only to these]
asr_egs/wsj/steps/decode_ctc_lat_tf.sh
asr_egs/wsj/steps/train_ctc_tf.sh [python 3 compatibility]
asr_egs/wsj/utils/ctc_token_fst.py
asr_egs/wsj/utils/model_topo.py

My next step will be to try to rework the recipe so that it matches the parameters sent by @ramonsanabria . Once I've got that done and confirmed I'll send a pull request.

efosler commented 6 years ago

NB: the decode script is woefully non-parallel (needs to be fixed), but for the online stuff this won't matter.

ericbolo commented 6 years ago

@efosler: wonderful, thanks !

I don't have the swbd db but I can adapt it for, say, tedlium.

On Fri, Jul 6, 2018, 12:06 AM Eric Fosler-Lussier notifications@github.com wrote:

@ericbolo https://github.com/ericbolo I've uploaded my changes to efosler/eesen so you can grab the newest copy. This should work - there are a few diffs with the graph prep scripts. Here's a list of the files that I changed so that you can just grab them if you want:

asr_egs/swbd/v1-tf/local/swbd1_data_prep.sh

asr_egs/swbd/v1-tf/local/swbd1_decode_graph_tf.sh

asr_egs/swbd/v1-tf/run_ctc_phn.sh

asr_egs/swbd/v1/local/swbd1_data_prep.sh [cosmetic changes only to these]

asr_egs/wsj/steps/decode_ctc_lat_tf.sh

asr_egs/wsj/steps/train_ctc_tf.sh [python 3 compatibility]

asr_egs/wsj/utils/ctc_token_fst.py

asr_egs/wsj/utils/model_topo.py

My next step will be to try to rework the recipe so that it matches the parameters sent by @ramonsanabria https://github.com/ramonsanabria . Once I've got that done and confirmed I'll send a pull request.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/193#issuecomment-402866670, or mute the thread https://github.com/notifications/unsubscribe-auth/AEXQ_EbYf7FANRUU4OBuQuUkC0MBrQRUks5uDo3QgaJpZM4U2hdg .

efosler commented 6 years ago

Hey @ramonsanabria , quick question: you said...

Also those ones that I think are implemented: 'nproj': 60, 'final_nproj': 100, 'init_nproj': 80. Otherwise I will push code to have them.

Looking through the code base, it seems like these are passed as parameters - will it not do the right thing if those parameters are set?

efosler commented 6 years ago

About to go offline for a bit, so I won't be able to report on the full run, but training with the parameters above (same as @fmetze 's run but with nproj=60, final_nproj=100, init_nproj=80) does get down to 11.7% TER, so I will make those the default with the script going forward. Decoding hasn't happened yet.

ericbolo commented 6 years ago

@efosler, a quick update: I was able to run the full pipeline with tensorflow + language model decoding on a dummy dataset. Thanks again !

Next steps re:online decoding (#141): implementing a forward-only LSTM, and the loss function for student-teacher learning.

On 7 July 2018 at 19:49, Eric Fosler-Lussier notifications@github.com wrote:

About to go offline for a bit, so I won't be able to report on the full run, but training with the parameters above (same as @fmetze https://github.com/fmetze 's run but with nproj=60, final_nproj=100, init_nproj=80) does get down to 11.7% TER, so I will make those the default with the script going forward. Decoding hasn't happened yet.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/193#issuecomment-403232508, or mute the thread https://github.com/notifications/unsubscribe-auth/AEXQ_DAAIxwLWmIlWqqyQVnLlV6rkaVKks5uEPTAgaJpZM4U2hdg .

-- Eric Bolo CTO tel: 06 29 72 19 80 http://06%2058%2079%2025%2037/

Batvoice Technologies 10, rue Coquillière - 75001 Paris www.batvoice.com

ericbolo commented 6 years ago

re: priors. As @efosler noted, it seems the priors are not used in the current decoding code.

in tf_test.py:

if(config[constants.CONFIG_TAGS_TEST.USE_PRIORS]):
            #TODO we need to add priors also:
            #feed_priors={i: y for i, y in zip(model.priors, config["prior"])}
            print(config[constants.CONFIG_TAGS_TEST.PRIORS_SCHEME])

model.priors doesn't seem to be generated anywhere, but we can use label.counts to generate it. @fmetze , are priors used in the original (c++) implementation ?

ramonsanabria commented 6 years ago

Hi all,

Priors are generated by:

labels=$dir_am/label.counts

gunzip -c $dir_am/labels.tr.gz cat | awk '{line=$0; gsub(" "," 0 ",line);
print line " 0";}' |
/data/ASR5/fmetze/eesen-block-copy/src/decoderbin/analyze-counts
--verbose=1 --binary=false ark:- $labels

Then you can use nnet.py as:


$decode_cmd JOB=1:$nj $mdl/log/decode.JOB.log \
  cat $PWD/$mdl/split$nj/JOB/feats.scp \| sort -k 1 \| python utils/nnet.py
--label-counts $labels --temperature $temperature --blank-scale $bkscale \|
\
  latgen-faster  --max-active=$max_active --max-mem=$max_mem --beam=$beam
--lattice-beam=$lattice_beam \
  --acoustic-scale=$acwt --allow-partial=true
--word-symbol-table=$graphdir/words.txt \
  $graphdir/TLG.fst ark:- "ark:|gzip -c > $mdl/lat/lat.JOB.gz" || exit 1;

This nnet.py is currently using tensorflow, I have a version that doesn't rely on that. I will push it now. Will keep you posted.

Thanks!

ramonsanabria commented 6 years ago

Here the commit of the new asr_egs/wsj/utils/nnet_notf.py (that does not use tf): 543c9edfe4e601b7f3e1f22feb7c9f64f5430908

Here are the parts of the code that I posted in the previous message (email responds does not support code in the markdown language):

labels=$dir_am/label.counts

gunzip -c $dir_am/labels.tr.gz cat | awk '{line=$0; gsub(" "," 0 ",line);
print line " 0";}' |
/data/ASR5/fmetze/eesen-block-copy/src/decoderbin/analyze-counts
--verbose=1 --binary=false ark:- $labels

$decode_cmd JOB=1:$nj $mdl/log/decode.JOB.log \
  cat $PWD/$mdl/split$nj/JOB/feats.scp \| sort -k 1 \| python utils/nnet.py
--label-counts $labels --temperature $temperature --blank-scale $bkscale \|
\
  latgen-faster  --max-active=$max_active --max-mem=$max_mem --beam=$beam
--lattice-beam=$lattice_beam \
  --acoustic-scale=$acwt --allow-partial=true
--word-symbol-table=$graphdir/words.txt \
  $graphdir/TLG.fst ark:- "ark:|gzip -c > $mdl/lat/lat.JOB.gz" || exit 1;

efosler commented 6 years ago

So, an update on my progress with SWB (now that I'm getting back to this). I haven't tried out @ramonsanabria 's code above yet.

I'm able to train a SWB system getting 11.8% TER on the CV set (much better than before). However, decoding with this (again not with priors) gives me a 40+% WER - much worse than the previous setup. I'm trying to debug this to understand where things are going wrong.

One thing I tried to do was turn on TER calculation during the forward pass. Had to do some modifications to steps/decode_ctc_am_tf.sh to make it pass the right flags to the test module. However, that was a non-starter it seems - the forward pass just hangs with no errors.

Seems like the next best step would be to just try to switch to @ramonsanabria 's decode strategy and abandon steps/decode_ctc_am_tf.sh?

efosler commented 6 years ago

@ramonsanabria what's a good (rough) value for blank_scale?

efosler commented 6 years ago

@ramonsanabria Now looking through nnet.py (and non-tf version) - this actually takes the output of the net and does the smoothing and priors as a filter, right? The code snippet you have above doesn't actually run the net forward, it seems to me, but would do something funky on the features in feats.scp.

ramonsanabria commented 6 years ago

Hi all,

How is it going. A good value for blank scale should be 0.9 and 1.1. But is something that we should play with. Exactly, the nnet.py script will only take the posteriors from the eesen, modify them slightly (add blank scaling, put blank in index-zero so WFST can read it, add temperature to the whole distribution, add priors which will certainly boost WER scores) and finally pipe it to the next script, which I believe it is the WFST decoding.

Will you guys be in India for Interspeech? Would be great to meet :)

2018-08-10 17:40 GMT+01:00 Eric Fosler-Lussier notifications@github.com:

@ramonsanabria https://github.com/ramonsanabria Now looking through nnet.py (and non-tf version) - this actually takes the output of the net and does the smoothing and priors as a filter, right? The code snippet you have above doesn't actually run the net forward, it seems to me, but would do something funky on the features in feats.scp.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/193#issuecomment-412138534, or mute the thread https://github.com/notifications/unsubscribe-auth/AMlwPWRcr8L4FFltzhLFlUa1kbv-ZW0yks5uPbeOgaJpZM4U2hdg .

ericbolo commented 6 years ago

India, I wish ! But no...

On Fri, Aug 10, 2018, 8:43 PM ramonsanabria notifications@github.com wrote:

Hi all,

How is it going. A good value for blank scale should be 0.9 and 1.1. But is something that we should play with. Exactly, the nnet.py script will only take the posteriors from the eesen, modify them slightly (add blank scaling, put blank in index-zero so WFST can read it, add temperature to the whole distribution, add priors which will certainly boost WER scores) and finally pipe it to the next script, which I believe it is the WFST decoding.

Will you guys be in India for Interspeech? Would be great to meet :)

2018-08-10 17:40 GMT+01:00 Eric Fosler-Lussier notifications@github.com:

@ramonsanabria https://github.com/ramonsanabria Now looking through nnet.py (and non-tf version) - this actually takes the output of the net and does the smoothing and priors as a filter, right? The code snippet you have above doesn't actually run the net forward, it seems to me, but would do something funky on the features in feats.scp.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/193#issuecomment-412138534, or mute the thread < https://github.com/notifications/unsubscribe-auth/AMlwPWRcr8L4FFltzhLFlUa1kbv-ZW0yks5uPbeOgaJpZM4U2hdg

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/193#issuecomment-412171233, or mute the thread https://github.com/notifications/unsubscribe-auth/AEXQ_K1PrDJkJz6OEpop29iaR5KFcWJoks5uPdRVgaJpZM4U2hdg .

efosler commented 6 years ago

Alas, I won't be in India either. (Maybe I might be able to stop by CMU sometime this semester.)

Update on progress: I wrote a small script to do greedy decoding on a logit/posterior stream and calculate the TER. (Will post this to my repo soonish and then send a pull request.) Found that on the SWB eval2000 test set I was getting 30% TER (this after priors; without priors it is worse). I was slightly puzzled by that, so I decided to calculate the TER on the train_dev set for SWB - I'm getting roughly 21-22% TER. This was a system that was reporting 11.8% TER on the same set during training. So something is rather hinky. Still digging, but if anyone has ideas, let me know.

efosler commented 6 years ago

I think I've enabled tf_train to dump out the forward pass on cv to see what's going on - if there is a difference in the output. Took me a good chunk of the evening. One thing I did run across is that the fp on subsampled data gets averaged in tf_test - it's not clear to me if the TER reported in tf_train is over averaged or (as I suspect) all variants. I don't think this could account for a factor of two in TER though.

FWIW, I think the code would be cleaner if tf_train and tf_test were factorized some - I had to copy a lot of code over and I worry about inconsistencies between them (although they are hooked together through the model class).

efosler commented 6 years ago

Update from yesterday (now that the swbd system has some time to train): the dumped cv ark files do not show the same CTC error rate as the system claims. I am suspecting that the averaging might be doing something weird. Writing down assumptions here and someone can pick this apart:

The implementation of the greedy decoder is just to take the max at each frame, remove duplicates, and then remove blanks. Shortest decoder I've ever written:

def greedy_decode(logits): return [i for i,_ in itertools.groupby(logits.argmax(1)) if i>0]

The averaging code (taken from tf_test.py) checks if there is augmented data. If so it computes the average logit stream from all examples. But if the network has radically different placement of the tokens it will likely give different decodings.

(Now this is making me wonder if the test set was augmented... hmmm...)

Anyway, just to give a sample of the difference in TER:

Reported by tf during training:

            Validate cost: 40.4, ter: 27.6%, #example: 11190
            Validate cost: 32.9, ter: 22.2%, #example: 11190
            Validate cost: 30.2, ter: 21.2%, #example: 11190
            Validate cost: 27.8, ter: 19.2%, #example: 11190
            Validate cost: 26.8, ter: 18.3%, #example: 11190
            Validate cost: 35.0, ter: 23.7%, #example: 11190
            Validate cost: 28.4, ter: 19.4%, #example: 11190
            Validate cost: 24.8, ter: 17.1%, #example: 11190

Decoding on the averaged stream:

TER = 76690 / 152641 = 50.2 TER = 69108 / 152380 = 45.4 TER = 68611 / 152380 = 45.0 TER = 62259 / 152380 = 40.9 TER = 59838 / 152380 = 39.3 TER = 72821 / 152380 = 47.8 TER = 61498 / 152380 = 40.4 TER = 59800 / 152380 = 39.2

efosler commented 6 years ago

@ramonsanabria and @fmetze can you confirm what the online feature augmentation is doing? I think I misunderstood it in my comments above. (I had visions of other types of augmentation going on but reading the code I think it's simpler than I thought.)

Looking through the code it seems like when you have the subsample and window set to 3, what it's doing is stacking three frames on the input, and making the input 3 times as small. Is it also creating three variants with different shifts? I'm trying to figure out where the averaging would come in later.

efosler commented 6 years ago

OK, I have figured out the discrepancy in output between forward passes and what is recorded by the training regime. tl;dr - the augmentation and averaging code in tf_test.py is at fault and should not be currently trusted. I'm working on a fix.

When training is done with augmentation (in this example, with window 3) 3 different shifted copies are created for training with stacked features. The TER is calculated for each copy (stream) by taking a forward pass and greedy decoding over the logits, then getting edit distance to the labels. The reported TER is over all copies.

At test time, it is not really clear what to do with 3 copies of the same logit stream. The test code (which I've replicated in the forward pass during training) assumes that correct thing to do is to average the logit streams. This would be appropriate for a traditional frame-based NN system. However, in a CTC-based system there is no guarantee of synchronization of outputs, so averaging the streams means that sometimes the blank label will dominate where it should not (for example: if one stream labels greedily "A blank blank", the second "blank A blank" and the third "blank blank A" then the average stream might label "blank blank blank" - causing a deletion).

I verified this by only dumping out the first stream in the averaging rather than the average, and found that the CV TER was identicial to that reported by the trainer. (That's not to say that the decoding was identical, but that the end number was the same.)

Upshot: it's probably best to arbitrarily take one of the streams and use it at test time - although is there a more appropriate combination scheme?

efosler commented 6 years ago

Created new issue for this particular bug. #194

efosler commented 6 years ago

Latest update: Decoding with sw+fish LM, incorporating priors, and fixing the averaging bug leads to 19.2% WER on eval 2000, with the swbd subset getting 13.4% WER (the kaldi triphone based system gets 13.3% WER on the same set, although this may be a more involved model). I think that this is close enough for a baseline to declare victory. I'll clean stuff up and then make a pull request.

efosler commented 6 years ago

Successful full train and decode; I also tested out a run with a slightly larger net (with a bit of improvement). Adding these baselines to the README file.

# CTC Phonemes on the Complete set (with 5 BiLSTM layers) with WFST decode                                                                                                                                                                                     
%WER 12.5 | 1831 21395 | 88.9 7.7 3.4 1.5 12.5 49.6 | exp/train_phn_fbank_pitch_l5_c320_mdeepbilstm_w3_ntrue_p60_ip80_fp80/results_epoch22_bs5.0_sw1_fsh_tgpr/score_8/eval2000.ctm.swbd.filt.sys
%WER 18.3 | 4459 42989 | 83.9 11.7 4.4 2.2 18.3 57.3 | exp/train_phn_fbank_pitch_l5_c320_mdeepbilstm_w3_ntrue_p60_ip80_fp80/results_epoch22_bs4.0_sw1_fsh_tgpr/score_8/eval2000.ctm.filt.sys
%WER 23.9 | 2628 21594 | 79.0 15.5 5.6 2.8 23.9 62.5 | exp/train_phn_fbank_pitch_l5_c320_mdeepbilstm_w3_ntrue_p60_ip80_fp80/results_epoch22_bs4.0_sw1_fsh_tgpr/score_8/eval2000.ctm.callhm.filt.sys

# Slightly larger model (400 units, 80 internal projections) with WFST decode                                                                                                                                                                                   
%WER 12.2 | 1831 21395 | 89.2 7.7 3.1 1.4 12.2 49.7 | exp/train_phn_fbank_pitch_l5_c400_mdeepbilstm_w3_ntrue_p80_ip80_fp80/results_epoch23_bs7.0_sw1_fsh_tgpr/score_10/eval2000.ctm.swbd.filt.sys
%WER 17.8 | 4459 42989 | 84.1 11.1 4.8 1.9 17.8 57.1 | exp/train_phn_fbank_pitch_l5_c400_mdeepbilstm_w3_ntrue_p80_ip80_fp80/results_epoch23_bs7.0_sw1_fsh_tgpr/score_9/eval2000.ctm.filt.sys
%WER 23.4 | 2628 21594 | 79.3 14.8 5.9 2.7 23.4 62.1 | exp/train_phn_fbank_pitch_l5_c400_mdeepbilstm_w3_ntrue_p80_ip80_fp80/results_epoch23_bs7.0_sw1_fsh_tgpr/score_10/eval2000.ctm.callhm.filt.sys

ramonsanabria commented 6 years ago

Awesome thank you so much Eric! the numbers looks great. Can you share the full training configuration?

Thank you again!

efosler commented 6 years ago

Just submitted the pull request (#196).

efosler commented 6 years ago

Once we decide that #196 is all good, I think we can close this particular thread!!!

efosler commented 6 years ago

OK, closing this particular thread. Whew!

srvk / eesen

[tf_clean] swb1/v1-tf WFST decoding - checking on assumptions #193