gitmylo / audio-webui

A webui for different audio related Neural Networks
MIT License
964 stars 90 forks source link

[QUESTION] reaching out for a contact, and to share some details about bark re: the tokens it uses #223

Open Alignment-Lab-AI opened 3 months ago

Alignment-Lab-AI commented 3 months ago

iirc, bark is an encoder decoder based on a t5, and uses wav2vec-bert as the encoder

i dont recall where i learned that but i feel like i validated this at some point, wav2vec-bert uses tokens which represent phonemes, those are converted to encodec codebooks which produce the final audio

id love to discuss some of our current work with tts, stt, and sts if you have the bandwidth! my email is autometa@alignmentlab.ai

more context about our org at https://AlignmentLab.ai

thanks! Austin

gitmylo commented 3 months ago

When I was figuring out how bark works, I wrote down my observations in here

I came up with a few theoretical methods that would allow voice cloning, which would be correct if my observations were correct. They were, and I published the code. The source code written related to "Method 2" can be found here The source code for "Method 3" can probably be found in the commit history for audio-webui. But you might need to go really far back. This method's outputs are far less convincing and much lower quality than method 2. So this might not be as interesting, but it could still help explain how bark's 3 step process works, exactly.