Closed tskim9439 closed 1 year ago
Hi,
Thanks for your attention. For preparing the text data, you can follow these steps:
TEXT=
DEST_DIR=
DICT=
fairseq-preprocess \
--only-source \
--trainpref $TEXT/train.txt \
--validpref $TEXT/valid.txt \
--destdir ${DEST_DIR} \
--srcdict ${DICT} \
--workers 20
You can use the transcription of LibriSpeech dev-clean and dev-other as the valid set for the text data. And please change the arguments accordingly.
Finally, you will get the bin and idx file for train and valid set. Please note that there is no manifest file for text.
You can save the manifest of speech and idx/bin files of text in one directory. Then it should be OK for the data preparation part.
Hi,
Thanks for your attention. For preparing the text data, you can follow these steps:
- Get the raw text, like LibriSpeech-LM corpus, and convert the text to lower cases.
- Use SPM model to process the text by spm_encode (The output_format is piece).
- Use fairseq-preprocess to generate the bin and idx files. For example, the script can be:
TEXT= DEST_DIR= DICT= fairseq-preprocess \ --only-source \ --trainpref $TEXT/train.txt \ --validpref $TEXT/valid.txt \ --destdir ${DEST_DIR} \ --srcdict ${DICT} \ --workers 20
You can use the transcription of LibriSpeech dev-clean and dev-other as the valid set for the text data. And please change the arguments accordingly.
Finally, you will get the bin and idx file for train and valid set. Please note that there is no manifest file for text.
You can save the manifest of speech and idx/bin files of text in one directory. Then it should be OK for the data preparation part.
Hello, I wonder if there is a restrict in the process of getting idx/bin files or not. Because when I use those files to fairseq-train and get this
Traceback (most recent call last):
File "/data/zqr/anaconda3/envs/python3.10.0/bin/fairseq-train", line 8, in
Hi
I had the same problem. I changed my numpy version to 1.23.5 and it worked
Hello. First of all, thank you for your great work.
Unfortunately, I have some issues on the preparation of the text data for the pre-training and the ASR finetuning. I have followed the introduction provided, but I cannot figure out how should I preprocess text data using SPM and fairseq. How can i create the text_train.tsv/text_valid.tsv? and also i have some difficulties of creating the label data of the text data, what format should i use? can you provide more details or examples of the manifest for the text?