Closed daocunyang closed 1 week ago
Sorry for too many (silly) questions. But another one regarding training:
We are trying to finetune the model sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20, which we found was converted from here. The latter mentions the following training command: ./pruned_transducer_stateless7_streaming/train.py
which I assume is equivalent to this file.
We hope to continue training based on the existing pretrained.pt
file or epoch-99.pt
in here, how can we do it? From this section of the doc, it seems we can specify --start-epoch 100
to resume training based on epoch-99.pt
, is that correct?
@JinZr Could you take another look when you get a chance, thanks a lot
If you want to continue train your model on your own data, I would recommend you to use finetune.py
.
Hi, opening a new issue since the old one has been closed.
Currently, we are writing our own
prepare.sh
to train an ASR model based on our own Chinese audio data, following the example of aishell'sprepare.sh
, but given our lack of experience we are unsure about some contents in it, below are the questions:What role does vocab_sizes play, and how to decide what number we should assign to it? Do we need it?
Looking at stage 5 to stage 8 of Aishell's
prepare.sh
, from what I can tell, we need to replaceaishell_transcript_v0.8.txt
(line 151) with our owntext
file, correct? Other than that, is there anything else we need to modify to prepare our own data during these stages?We currently have a few hundred audio files for training (not so many), how do you suggest we divide the data for training and test set? I'm thinking of using most or probably all of them for training, and few or even none of them for the test set.
Just to confirm, we can get rid of the part related to Whisper large-v3 at the end of
prepare.sh
, since we are not using Whisper.We plan to use the
lexicon.txt
file from Aishell, but we notice there are certain words which are important to us yet are missing from the current lexicon.txt. For example, we want to add the word"对的"
tolexicon.txt
. But I wonder if it is necessary to add it to lexicon.txt? Because I noticed thelexicon.txt
from Aishell already contains the following, which are the parts that make up the word"对的"
:Thanks in advance.