k2-fsa / icefall

https://k2-fsa.github.io/icefall/
Apache License 2.0
884 stars 286 forks source link

Gigaspeech experiment #341

Open zzzacwork opened 2 years ago

zzzacwork commented 2 years ago

Hi guys,

I am trying to replicate the results from the gigaspeech recipe, as a comparison to other models we trained before.

The pretrained model was obtained from Gigaspeech repo

The command I run is as following,

nohup ./conformer_ctc/decode.py   --epoch 0  --method whole-lattice-rescoring   --num-paths 1000   --exp-dir icefall-asr-gigaspeech-conformer-ctc/exp   --lang-dir icefall-asr-gigaspeech-conformer-ctc/data/lang_bpe_500 --lm-dir icefall-asr-gigaspeech-conformer-ctc/data/lm   --max-duration 20   --num-workers 1 

the first thing is I got surprisingly or suspiciously good results on the test data lm_scale_0.1 5.22 best for test, compared to the result reported. Did I do something wrong? I also got similar result from the dev dataset.

the second thing is I tried to use CPU arch on a 64 core machine but the decoding program only make use of 2 cores,(from the top command), even after I change the num-workers to 60. is there any other controlling parameter for parallelization on CPU?

Thank you!

csukuangfj commented 2 years ago

For the WER, you can look at the decoding results. There should be two files errs-xxxxx and recogs-xxx in the decoding directory.

For the usage of CPU, only the neural network computation part can be run in parallel on CPU. You can search for how to do parallel computation with PyTorch on CPU. The search part with k2 runs on a single CPU if you only have CPUs available.

zzzacwork commented 2 years ago

Thank your for the response, I tried a separate WER calculation on the output recogs-xxx file and got 10.25% on gigaspeech testset, which is very similar to the score reported.

I am also wondering if there is an endpoint detector(VAD) inside icefall that can be used to preprocess input voice streams. I cannot find it mentioned from the codebase.

danpovey commented 2 years ago

We haven't implemented endpoint detection yet.

zzzacwork commented 2 years ago

Thanks for the response!

We are trying to adapt the general model to a specific sub domain(medical), with only text data.

I generated another G_3_gram.pt and G_4_gram.pt with the pretrained bpe model and regenerated the HLG.pt from it, with modified lexicon.txt and words.txt as well.

The result WER is OK but it doesn't pick up long medical words, compared to Kaldi's TDNN model. (we are using the pretrained conformer+CTC model). One thing I noticed is that kaldi has a word insertion penalty while I cannot find this parameter from icefall. Is there an equivalent parameter for the same usage?

Do you have some advice on what we might do to improve on this?

Thank you!

danpovey commented 2 years ago

What scale are you using on the LM scores? I think an issue with the transformer decoder type of model- and actually, most "end-to-end" models- is that the model itself has an LM, which means that if you want to combine it with an external LM you have to use a scaling factor, which will mean the external LM has less influence than it would with hybrid systems. I think we've played around in the past with word penalties in icefall but I don't recall that we saw much benefit for them.

zzzacwork commented 2 years ago

I tried the LM scale from 0.2 to 1. I don't quite understand the LM with the end-to-end model itself, is that the G that was composed with H and L in HLG model? I also swapped that LM(G_3_gram.pt) with the 3gram LM we built with our data.

I am wondering if there is a way to look into the search space resulted from the HLG fst composition, before the whole lattice rescore. I would like to double check if the ref was in the lattice so that I can narrow down the issue to pruning or the model score.

danpovey commented 2 years ago

The external LM would be the G. The end-to-model implicitly has an LM but it can't really be separated from the rest of the model. Search in icefall for 'oracle', there is now a way to measure oracle WERs of lattices I think. IDK the best way to print them out / inspect them, though.

zzzacwork commented 2 years ago

I read a blog(https://lorenlugosch.github.io/posts/2020/11/transducer/) talking about training(pretrain) the predictor just on text data. Does icefall currently support that in the RNNT recipe? I hope that will help with the adaptation problem.

wgb14 commented 2 years ago

Hi guys,

I am trying to replicate the results from the gigaspeech recipe, as a comparison to other models we trained before.

The pretrained model was obtained from Gigaspeech repo

The command I run is as following,

nohup ./conformer_ctc/decode.py   --epoch 0  --method whole-lattice-rescoring   --num-paths 1000   --exp-dir icefall-asr-gigaspeech-conformer-ctc/exp   --lang-dir icefall-asr-gigaspeech-conformer-ctc/data/lang_bpe_500 --lm-dir icefall-asr-gigaspeech-conformer-ctc/data/lm   --max-duration 20   --num-workers 1 

the first thing is I got surprisingly or suspiciously good results on the test data lm_scale_0.1 5.22 best for test, compared to the result reported. Did I do something wrong? I also got similar result from the dev dataset.

the second thing is I tried to use CPU arch on a 64 core machine but the decoding program only make use of 2 cores,(from the top command), even after I change the num-workers to 60. is there any other controlling parameter for parallelization on CPU?

Thank you!

Sorry, this is a bug in gigaspeech decoding script, should be fixed in https://github.com/k2-fsa/icefall/pull/352

danpovey commented 2 years ago

I read a blog(https://lorenlugosch.github.io/posts/2020/11/transducer/) talking about training(pretrain) the predictor just on text data. Does icefall currently support that in the RNNT recipe? I hope that will help with the adaptation problem.

We don't support that. Anyway, our predictor (we call it "decoder") network only sees 2 symbols of context ("stateless transducer") which probably makes that irrelevant.