Code formatting and updates - Githubissues

thu-spmi / CAT

A CRF-based ASR Toolkit

Apache License 2.0

325 stars 74 forks source link

Code formatting and updates #77

Closed maxwellzh closed 1 year ago

maxwellzh commented 1 year ago

Changes in these commits:

Format .py files with py-black formatter;
Switch to new impl of CorpusDataset packing & loading for better robustness. After update, one has to clean previous packed data (commonly in exp/xxx/pkl) and re-runs stage 2 (data packing);
[important] Dataloading for large corpora training supports bucket-like loading. After update, the options in hyper-p.json:train:option should be updated accordingly. Please take a look at the new template egs/TEMPLATE/exp/asr-ctc-large-corpora;
LM data for template experiments is switched to libri corpus, for PTB data is not available for free;
Remove path_weight related codes in ctc_crf source files;
Module rename:
- 'am' -> 'encoder' in CTC trainer;
- 'trans' -> 'clm' in CausalTransformer.
These two changes would cause trouble when loading previous checkpoints. Use scripts in utils/compat/ to immigrate checkpoints to match latest naming.
Add a PretrainedTokenizer impl. to support loading pretrained tokenizer from huggingface.