Open MethanJess opened 7 months ago
the validation file should have the same format as sample_dataset.csv ... once you generate a whole dataset, and have split it into a large training set and small validation set manually, you can then place respective file ids into the csvs
@vatsalaggarwal Really not sure what that means... but I've heard that some contributors of this project (@lucapericlp and @danablend) have their own dataset generator, would it be fine if they could share theirs? (or anyone else?)
Hello hai Dosto Keise ho Sab Log
Hey @MethanJess, sorry for the late reply, I've just followed a similar process as pointed out by @vatsalaggarwal for putting together the datasets but I don't have any special generators of my own. If you're running into any issues in putting together a useful data pipeline, let us know & we'll see if we can help!
Hey @lucapericlp I found this repository: https://github.com/daswer123/xtts-webui It has a dataset generator that split audio and transcribes it making a transcription of each audio segment, and a validation file. This was made for Coqui, but the format it creates is very similar to the one of MetaVoice, just a little bit of editing and it would work! right?
Hi, I already know there's Speech Dataset Generator However, it's way too bloated with features and I couldn't get it to work on my system.
So, does anyone have a simple script that splits an audio file into segments, and converts the audio into to the right sample rate, then uses WhisperX large-v3 to transcribe the segments to make "sample_dataset.csv", and "sample_val_dataset.csv"? (and anything else if there's any)?
I tried making my own but I have no idea how to make the validation file thing...