WIP: Add BPE training with LF-MMI.

k2-fsa / snowfall

Moved to https://github.com/k2-fsa/icefall

Apache License 2.0

143 stars 42 forks source link

WIP: Add BPE training with LF-MMI. #215

Open csukuangfj opened 3 years ago

csukuangfj commented 3 years ago

A small vocab_size, e.g., 200, is used to avoid OOM if the bigram P is used. After removing P, it is possible to use a large vocab size, e.g., 5000.

@glynpu is doing BPE CTC training. We can use his implementation once it's ready. This pull-request is for experimental purpose.

Will add decoding code later.

The training is still on-going. The tensorboard training log is available at https://tensorboard.dev/experiment/CN5yTQNmTLODdyLZA6K8rQ/#scalars&runSelectionState=eyIuIjp0cnVlfQ%3D%3D

danpovey commented 3 years ago

BTW, the way I think we can solve the memory-blowup issue is: (i) use the new, more compact CTC topo (ii) train a bigram ARPA LM to make a compact LM, e.g. with kaldi's make_kn_lm.py; load it into k2 as P (no disambig symbols!), and remove epsilons. k2 uses a rm-epsilon algorithm that should keep the epsilon-free LM compact, unlike OpenFst which would cause it to blow up. BTW, I am asking some others to add a pruning option to make_kn_lm.py.

csukuangfj commented 3 years ago

(i) use the new, more compact CTC topo

Yes, I am using the new CTC topo.

(ii) train a bigram ARPA LM to make a compact LM, e.g. with kaldi's make_kn_lm.py;

Will update the code to train a word piece bigram ARPA LM with make_kn_lm.py.