thu-spmi / CAT

A CRF-based ASR Toolkit
Apache License 2.0
325 stars 74 forks source link

Code formatting and updates #77

Closed maxwellzh closed 1 year ago

maxwellzh commented 1 year ago

Changes in these commits:

  1. Format .py files with py-black formatter;
  2. Switch to new impl of CorpusDataset packing & loading for better robustness. After update, one has to clean previous packed data (commonly in exp/xxx/pkl) and re-runs stage 2 (data packing);
  3. [important] Dataloading for large corpora training supports bucket-like loading. After update, the options in hyper-p.json:train:option should be updated accordingly. Please take a look at the new template egs/TEMPLATE/exp/asr-ctc-large-corpora;
  4. LM data for template experiments is switched to libri corpus, for PTB data is not available for free;
  5. Remove path_weight related codes in ctc_crf source files;
  6. Module rename:

    • 'am' -> 'encoder' in CTC trainer;
    • 'trans' -> 'clm' in CausalTransformer.

    These two changes would cause trouble when loading previous checkpoints. Use scripts in utils/compat/ to immigrate checkpoints to match latest naming.

  7. Add a PretrainedTokenizer impl. to support loading pretrained tokenizer from huggingface.