Open csukuangfj opened 3 years ago
BTW, the way I think we can solve the memory-blowup issue is: (i) use the new, more compact CTC topo (ii) train a bigram ARPA LM to make a compact LM, e.g. with kaldi's make_kn_lm.py; load it into k2 as P (no disambig symbols!), and remove epsilons. k2 uses a rm-epsilon algorithm that should keep the epsilon-free LM compact, unlike OpenFst which would cause it to blow up. BTW, I am asking some others to add a pruning option to make_kn_lm.py.
(i) use the new, more compact CTC topo
Yes, I am using the new CTC topo.
(ii) train a bigram ARPA LM to make a compact LM, e.g. with kaldi's make_kn_lm.py;
Will update the code to train a word piece bigram ARPA LM with make_kn_lm.py
.
A small vocab_size, e.g., 200, is used to avoid OOM if the bigram P is used. After removing P, it is possible to use a large vocab size, e.g., 5000.
@glynpu is doing BPE CTC training. We can use his implementation once it's ready. This pull-request is for experimental purpose.
Will add decoding code later.
--
The training is still on-going. The tensorboard training log is available at https://tensorboard.dev/experiment/CN5yTQNmTLODdyLZA6K8rQ/#scalars&runSelectionState=eyIuIjp0cnVlfQ%3D%3D