Simple Dataset generator?

metavoiceio / metavoice-src

Foundational model for human-like, expressive TTS

https://themetavoice.xyz/

Apache License 2.0

3.88k stars 658 forks source link

Simple Dataset generator? #122

Open MethanJess opened 7 months ago

MethanJess commented 7 months ago

Hi, I already know there's Speech Dataset Generator However, it's way too bloated with features and I couldn't get it to work on my system.

So, does anyone have a simple script that splits an audio file into segments, and converts the audio into to the right sample rate, then uses WhisperX large-v3 to transcribe the segments to make "sample_dataset.csv", and "sample_val_dataset.csv"? (and anything else if there's any)?

I tried making my own but I have no idea how to make the validation file thing...

vatsalaggarwal commented 7 months ago

the validation file should have the same format as sample_dataset.csv ... once you generate a whole dataset, and have split it into a large training set and small validation set manually, you can then place respective file ids into the csvs

MethanJess commented 7 months ago

@vatsalaggarwal Really not sure what that means... but I've heard that some contributors of this project (@lucapericlp and @danablend) have their own dataset generator, would it be fine if they could share theirs? (or anyone else?)

Vijayvk9092 commented 6 months ago

Hello hai Dosto Keise ho Sab Log

lucapericlp commented 5 months ago

Hey @MethanJess, sorry for the late reply, I've just followed a similar process as pointed out by @vatsalaggarwal for putting together the datasets but I don't have any special generators of my own. If you're running into any issues in putting together a useful data pipeline, let us know & we'll see if we can help!

MethanJess commented 5 months ago

Hey @lucapericlp I found this repository: https://github.com/daswer123/xtts-webui It has a dataset generator that split audio and transcribes it making a transcription of each audio segment, and a validation file. This was made for Coqui, but the format it creates is very similar to the one of MetaVoice, just a little bit of editing and it would work! right?