openspeech-team / openspeech

Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.
https://openspeech-team.github.io/openspeech/
MIT License
670 stars 112 forks source link

What is the correct format of the path for datasets and manifest files (LibriSpeech) #177

Closed AnneCCyy closed 1 year ago

AnneCCyy commented 1 year ago

❓ Questions & Help

I tried to implement the example (attached at the end), and found the downloaded dataset is in a newly created folder "outputs/ #date&time". Then, looking into the downloaded dataset, I have no idea what the supposed format of parameters "$DATASET_PATH" "$MANIFEST_FILE_PATH".

I got stuck at the "tokenlizer" step "can't find the file - sp.model" (similar to issue #144 but not the same) and it seems that the correct folder structures/path of these downloaded dataset is not clear to me. In particular, where is the "manifest file" located? It should be created by the code during preprocessing, but I did not observe it.

Any specific instructions about the correct folder structures/path of the dataset/manifest_file and sample input parameters for "$DATASET_PATH" "$MANIFEST_FILE_PATH" are greatly appreciated.

Details

---- Error message------- "Traceback (most recent call last): File "./openspeech_cli/hydra_train.py", line 46, in hydra_main tokenizer = TOKENIZER_REGISTRYconfigs.tokenizer.unit File "/home/ai-labs-leiachen/Projects_Leian/openspeech/openspeech/tokenizers/librispeech/subword.py", line 69, in init self.sp.Load(os.path.join(configs.tokenizer.vocab_path, f"{SENTENCEPIECE_MODEL_NAME}.model")) File "/home/ai-labs-leiachen/.conda/envs/ASR_3.7/lib/python3.7/site-packages/sentencepiece/init.py", line 905, in Load return self.LoadFromFile(model_file) File "/home/ai-labs-leiachen/.conda/envs/ASR_3.7/lib/python3.7/site-packages/sentencepiece/init.py", line 310, in LoadFromFile return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg) OSError: Not found: "../../../LibriSpeech/sp.model": No such file or directory Error #2"

--------Example by author------------------- $ python3 ./openspeech_cli/hydra_train.py \ dataset=librispeech \ dataset.dataset_download=True \ dataset.dataset_path=$DATASET_PATH \ dataset.manifest_file_path=$MANIFEST_FILE_PATH \
tokenizer=libri_subword \ model=conformer_lstm \ audio=fbank \ lr_scheduler=warmup_reduce_lr_on_plateau \ trainer=gpu \ criterion=cross_entropy

upskyy commented 1 year ago

Here is the code that tokenizer sp.model is created.

$DATASET_PATH is the following structure.

$DATASET_PATH 
├── BOOKS.TXT
├── CHAPTERS.TXT
├── LICENSE.TXT
├── LibriSpeech
├── README.TXT
├── SPEAKERS.TXT
├── dev-clean
├── dev-other
├── libri_character_labels.csv
├── libri_character_manifest.txt
├── libri_test_character_manifest.txt
├── results_libri_character.txt
├── test-clean
├── test-other
└── train-960

And $MANIFEST_FILE_PATH can be set to the path where you want to save it.

Like #144, when downloading and preprocessing librispeech, the directory structure is one low, so the error appears. I'll fix it when I have time.

upskyy commented 1 year ago

If there is an issue, please open it again.