Support for partial data usage for LibriSpeech

kushal-g commented 2 years ago

There should be a functionality where instead of having to download entire dataset and train on it, we could download just partial data and use only that for training. And if not, then the documentation should clearly mention how the dataset directory structure should look like so that it's easier for us to use our own partial dataset. I'm currently trying to train a RNN-T model and I keep facing issues with directory structure.

Command that I'm using python ./openspeech_cli/hydra_train.py dataset=librispeech dataset.dataset_download=False dataset.dataset_path=/home/guest/flsp/SpeechToText/RNN-T/openspeech/LIBRISPEECH_AUTO_DOWNLOAD/LibriSpeech dataset.manifest_file_path=/home/guest/flsp/SpeechToText/RNN-T/openspeech/LIBRISPEECH_AUTO_MANIFEST tokenizer=libri_subword model=rnn_transducer audio=melspectrogram lr_scheduler=warmup_reduce_lr_on_plateau trainer=gpu

sooftware commented 2 years ago

There were many questions about the directory structure, so I thought I should document it.
Please wait for a moment.

kushal-g commented 2 years ago

What is the status of this?

openspeech-team / openspeech

Support for partial data usage for LibriSpeech #105