Text data preparation - Githubissues

tskim9439 commented 1 year ago

Hello. First of all, thank you for your great work.

Unfortunately, I have some issues on the preparation of the text data for the pre-training and the ASR finetuning. I have followed the introduction provided, but I cannot figure out how should I preprocess text data using SPM and fairseq. How can i create the text_train.tsv/text_valid.tsv? and also i have some difficulties of creating the label data of the text data, what format should i use? can you provide more details or examples of the manifest for the text?

Ajyy commented 1 year ago

Hi,

Thanks for your attention. For preparing the text data, you can follow these steps:

Get the raw text, like LibriSpeech-LM corpus, and convert the text to lower cases.
Use SPM model to process the text by spm_encode (The output_format is piece).
Use fairseq-preprocess to generate the bin and idx files. For example, the script can be:
```
TEXT=
DEST_DIR=
DICT=
fairseq-preprocess \
--only-source \
--trainpref $TEXT/train.txt \
--validpref $TEXT/valid.txt \
--destdir ${DEST_DIR} \
--srcdict ${DICT} \
--workers 20
```
You can use the transcription of LibriSpeech dev-clean and dev-other as the valid set for the text data. And please change the arguments accordingly.

Finally, you will get the bin and idx file for train and valid set. Please note that there is no manifest file for text.

You can save the manifest of speech and idx/bin files of text in one directory. Then it should be OK for the data preparation part.

Lemonaddeee commented 1 year ago

Hi,

Thanks for your attention. For preparing the text data, you can follow these steps:

Get the raw text, like LibriSpeech-LM corpus, and convert the text to lower cases.

Use SPM model to process the text by spm_encode (The output_format is piece).

Use fairseq-preprocess to generate the bin and idx files. For example, the script can be:
TEXT=
DEST_DIR=
DICT=
fairseq-preprocess \
    --only-source \
    --trainpref $TEXT/train.txt \
    --validpref $TEXT/valid.txt \
    --destdir ${DEST_DIR} \
    --srcdict ${DICT} \
    --workers 20
You can use the transcription of LibriSpeech dev-clean and dev-other as the valid set for the text data. And please change the arguments accordingly.

Finally, you will get the bin and idx file for train and valid set. Please note that there is no manifest file for text.

You can save the manifest of speech and idx/bin files of text in one directory. Then it should be OK for the data preparation part.

Hello, I wonder if there is a restrict in the process of getting idx/bin files or not. Because when I use those files to fairseq-train and get this

Traceback (most recent call last): File "/data/zqr/anaconda3/envs/python3.10.0/bin/fairseq-train", line 8, in sys.exit(cli_main()) File "/data/zqr/anaconda3/envs/python3.10.0/lib/python3.10/site-packages/fairseq_cli/train.py", line 557, in cli_main distributed_utils.call_main(cfg, main) File "/data/zqr/anaconda3/envs/python3.10.0/lib/python3.10/site-packages/fairseq/distributed/utils.py", line 369, in call_main main(cfg, **kwargs) File "/data/zqr/anaconda3/envs/python3.10.0/lib/python3.10/site-packages/fairseq_cli/train.py", line 164, in main extra_state, epoch_itr = checkpoint_utils.load_checkpoint( File "/data/zqr/anaconda3/envs/python3.10.0/lib/python3.10/site-packages/fairseq/checkpoint_utils.py", line 272, in load_checkpoint epoch_itr = trainer.get_train_iterator( File "/data/zqr/anaconda3/envs/python3.10.0/lib/python3.10/site-packages/fairseq/trainer.py", line 695, in get_train_iterator batch_iterator = self.task.get_batch_iterator( File "/data/zqr/anaconda3/envs/python3.10.0/lib/python3.10/site-packages/fairseq/tasks/fairseq_task.py", line 295, in get_batch_iterator batch_sampler = dataset.batch_by_size( File "/data/zqr/tmp/pycharm_project_2/speecht5_raw/SpeechT5/SpeechT5/speecht5/data/multitask_dataset.py", line 207, in batch_by_size batch_sampler = np.array(batch_sampler) ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (371,) + inhomogeneous part.

djanibekov commented 1 year ago

Hi

I had the same problem. I changed my numpy version to 1.23.5 and it worked

microsoft / SpeechT5

Text data preparation #9