openspeech-team / openspeech

Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.
https://openspeech-team.github.io/openspeech/
MIT License
670 stars 112 forks source link

First tasks using Openspeech #214

Closed filbattaglia closed 6 months ago

filbattaglia commented 7 months ago

Good morning, I've installed openspeech for the first time. I've tried to run my first training operation using Librispeech dataset but at the moment without result. I have used the following command line:

python ../../openspeech_cli/hydra_train.py dataset=librispeech \ dataset.dataset_download=True \ dataset.dataset_path="/home/filippo/LibriSpeech_Dw/" \ dataset.manifest_file_path="/home/filippo/LibriSpeech_Manifest/" \ tokenizer=libri_character \ model=contextnet \ audio=fbank \ lr_scheduler=warmup_reduce_lr_on_plateau \ trainer=gpu \ criterion=ctc

Openspeech downloads the LibriSpeech .flac dataset in /LibriSpeech_Dw subfolder but after this it fails with the following message:

FileNotFoundError:[Errno 2] No such file or directory: '../../../LibriSpeech/libri_labels.csv

I don't know where I can find the missing libri_labels.csv file. It seems that it is not included in standard openspeech distribution and I don't know where it can be downloaded.

Can you help me with some information ? Thanks in advance for your help :)

upskyy commented 7 months ago

@filbattaglia Thank you for your interest in openspeech. LibriSpeech data is downloaded and preprocessed, and the libri_labels.csv file is saved in the specified path.

filbattaglia commented 7 months ago

@filbattaglia Thank you for your interest in openspeech. LibriSpeech data is downloaded and preprocessed, and the libri_labels.csv file is saved in the specified path.

Thanks for your answer. Unfortunately I am unable to find a file named libri_labels.csv in my file system. Can you tell me in which folder it should be present on disk ?

Can the issue be determined by a skipped elaboration of librispeech data by openspeech ?

upskyy commented 7 months ago

@filbattaglia

When executing the command, add your save path as follows.

tokenizer.vocab_path=$VOCAB_FILE_PATH