k2-fsa / icefall

https://k2-fsa.github.io/icefall/
Apache License 2.0
884 stars 286 forks source link

Questions about modifying prepare.sh for training ASR model on custom data #1636

Closed daocunyang closed 1 week ago

daocunyang commented 3 months ago

Hi, opening a new issue since the old one has been closed.

Currently, we are writing our own prepare.sh to train an ASR model based on our own Chinese audio data, following the example of aishell's prepare.sh, but given our lack of experience we are unsure about some contents in it, below are the questions:

  1. What role does vocab_sizes play, and how to decide what number we should assign to it? Do we need it?

  2. Looking at stage 5 to stage 8 of Aishell's prepare.sh, from what I can tell, we need to replace aishell_transcript_v0.8.txt (line 151) with our own text file, correct? Other than that, is there anything else we need to modify to prepare our own data during these stages?

  3. We currently have a few hundred audio files for training (not so many), how do you suggest we divide the data for training and test set? I'm thinking of using most or probably all of them for training, and few or even none of them for the test set.

  4. Just to confirm, we can get rid of the part related to Whisper large-v3 at the end of prepare.sh, since we are not using Whisper.

  5. We plan to use the lexicon.txt file from Aishell, but we notice there are certain words which are important to us yet are missing from the current lexicon.txt. For example, we want to add the word "对的" to lexicon.txt. But I wonder if it is necessary to add it to lexicon.txt? Because I noticed the lexicon.txt from Aishell already contains the following, which are the parts that make up the word "对的":

    对 d ui4
    的 d e5
    的 d i2
    的 d i4

Thanks in advance.

daocunyang commented 3 months ago

Sorry for too many (silly) questions. But another one regarding training:

We are trying to finetune the model sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20, which we found was converted from here. The latter mentions the following training command: ./pruned_transducer_stateless7_streaming/train.py which I assume is equivalent to this file.

We hope to continue training based on the existing pretrained.pt file or epoch-99.pt in here, how can we do it? From this section of the doc, it seems we can specify --start-epoch 100 to resume training based on epoch-99.pt , is that correct?

@JinZr Could you take another look when you get a chance, thanks a lot

marcoyang1998 commented 3 months ago

If you want to continue train your model on your own data, I would recommend you to use finetune.py.