Switch to new impl of CorpusDataset packing & loading for better robustness. After update, one has to clean previous packed data (commonly in exp/xxx/pkl) and re-runs stage 2 (data packing);
[important] Dataloading for large corpora training supports bucket-like loading. After update, the options in hyper-p.json:train:option should be updated accordingly. Please take a look at the new template egs/TEMPLATE/exp/asr-ctc-large-corpora;
LM data for template experiments is switched to libri corpus, for PTB data is not available for free;
Remove path_weight related codes in ctc_crf source files;
Module rename:
'am' -> 'encoder' in CTC trainer;
'trans' -> 'clm' in CausalTransformer.
These two changes would cause trouble when loading previous checkpoints. Use scripts in utils/compat/ to immigrate checkpoints to match latest naming.
Add a PretrainedTokenizer impl. to support loading pretrained tokenizer from huggingface.
Changes in these commits:
exp/xxx/pkl
) and re-runs stage 2 (data packing);hyper-p.json:train:option
should be updated accordingly. Please take a look at the new templateegs/TEMPLATE/exp/asr-ctc-large-corpora
;path_weight
related codes inctc_crf
source files;Module rename:
These two changes would cause trouble when loading previous checkpoints. Use scripts in
utils/compat/
to immigrate checkpoints to match latest naming.