Open huydung179 opened 1 year ago
The dataset creation code is up at https://github.com/gitmylo/bark-data-gen
To get the semantics from a voice, you have to use a trained HuBERT quantizer model. See a problem? It cannot be improved for a specific voice, because all you could train on, is previous outputs.
To understand why it works, you need to understand how bark works. https://github.com/gitmylo/audio-webui/wiki/how-bark-works The quantizer model just converts recognized speech patterns into a format which bark understands, and is able to complete. Essentially cloning a voice.
Dear gitmylo, I also want to know how to create semantic data from wav source files. I gather Korean wav files and I need to make semantic data from them, also need to pre-train both semantic data and wav files. Could you explain about details. I really appreciate your great job.
If you want to train, you'll need a text dataset in the language you want to train for, you can modify the bark-data-gen code to load text files in another language for example. Then prepare the dataset, and train, as explained in https://github.com/gitmylo/bark-voice-cloning-HuBERT-quantizer#how-do-i-train-it-myself. And just follow the other steps.
If I well understood, you used a custom semantic-voice dataset for training your HuBERT model. Can you tell me how to create this dataset? Especially how to get the semantic from a voice? Many thanks for this work.