openspeech-team / openspeech

Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.
https://openspeech-team.github.io/openspeech/
MIT License
677 stars 114 forks source link

How to preprocess datasets? #131

Closed wblwty closed 2 years ago

wblwty commented 2 years ago

❓ Questions & Help

I'm sorry to bother you, but I'm having some problems.

I wonder how OpenSpeech preprocesses the dataset. I see preprocess, but I don't see start up. I wonder how I should use it to get Manifest.

I sent an email, but I didn't get a reply.

Details

upskyy commented 2 years ago

I have responded to all emails, which dataset do you want to use?

wblwty commented 2 years ago

I need manifest of KsponSpeech ,If possible, please send to wb724483933@163.com

wblwty commented 2 years ago

I have responded to all emails, which dataset do you want to use? I can train it myself manifes?I can't find the code for this part,Can you tell me about it? thank you

wblwty commented 2 years ago

I wanted to run lit_data_module.py to get the manifest, but

/home/iip/anaconda3/envs/openspeech/bin/python3.7 /home/iip/wb/openspeech/openspeech/datasets/ksponspeech/lit_data_module.py Traceback (most recent call last): File "/home/iip/wangbo/openspeech/openspeech/datasets/ksponspeech/lit_data_module.py", line 41, in class LightningKsponSpeechDataModule(pl.LightningDataModule): File "/home/iip/wangbo/openspeech/openspeech/datasets/init.py", line 46, in register_data_module_cls raise ValueError(f"Cannot register duplicate data module ({name})") ValueError: Cannot register duplicate data module (ksponspeech)

Process finished with exit code 1

upskyy commented 2 years ago

When you run hydra_train.py, a manifest file is created in the path you specified. [LINK]
And KsponSpeech needs permission from AI Hub. So please send e-mail including the approved screenshot to openspeech.team@gmail.com. [LINK]
This is the code that preprocesses the data to generate a manifest file.

I recommend that you read the README.md carefully. Thanks.

wblwty commented 2 years ago

When you run hydra_train.py, a manifest file is created in the path you specified. [LINK] And KsponSpeech needs permission from AI Hub. So please send e-mail including the approved screenshot to openspeech.team@gmail.com. [LINK] This is the code that preprocesses the data to generate a manifest file.

I recommend that you read the README.md carefully. Thanks. Ok, thank you

afterrealism commented 2 years ago

I also got stuck on this point I ended up creating the manifest file :D Maybe removing this option dataset.manifest_file_path in cli script and creating manifest in dataset.dataset_path might be a way to go.