(Took me way too long to realize this, and it just goes to show that most of us are just point and click type of fellas who don't really understand what we're using - not really a skiddie because we can code, but.. .you get what I mean)
So if the whole point of using bark-generated audio training a quantizer like this,
instead of simply grabbing a massive good dataset of audio and having whisper transcribe and then adding in tags or correcting as needed (or god forbid, manually finding voice clips with actual good audio that matches more or less what you hear)
is simply because you don't know how exactly they trained their hubert voice features to semantic tokens mapping, the unknown here being the semantic tokens, and you want to make sure you at least start off from THEIR hubert voice features to tokens mapping and refine it,
... then couldn't a general-purpose hubert to semantic tokens quantizer be made instead? You would just generate or aggregate all the supported languages, generate datasets if you don't have them already, train a quantizer on ALL OF THAT since its aim is just a reverse "tell me the semantic tokens for this series of sounds" and it should theoretically cover any "known" language
(minus the african ones because there's no statistically significant presence of tongue-click languages on the internet, but knowing bark and its random noises during generation, assuming hubert model has also learned that, it probably CAN map a tongue-click language too)
I see you did it for english, but I'm wondering why everyone has stopped at a single language quantizer when it can probably be made into an omnilingual quantizer.
I ask in the name of languages like Klingon, Middle English, Old English, Vietnamese with a southern accent....
(Took me way too long to realize this, and it just goes to show that most of us are just point and click type of fellas who don't really understand what we're using - not really a skiddie because we can code, but.. .you get what I mean)
So if the whole point of using bark-generated audio training a quantizer like this,
instead of simply grabbing a massive good dataset of audio and having whisper transcribe and then adding in tags or correcting as needed (or god forbid, manually finding voice clips with actual good audio that matches more or less what you hear)
is simply because you don't know how exactly they trained their hubert voice features to semantic tokens mapping, the unknown here being the semantic tokens, and you want to make sure you at least start off from THEIR hubert voice features to tokens mapping and refine it,
... then couldn't a general-purpose hubert to semantic tokens quantizer be made instead? You would just generate or aggregate all the supported languages, generate datasets if you don't have them already, train a quantizer on ALL OF THAT since its aim is just a reverse "tell me the semantic tokens for this series of sounds" and it should theoretically cover any "known" language
(minus the african ones because there's no statistically significant presence of tongue-click languages on the internet, but knowing bark and its random noises during generation, assuming hubert model has also learned that, it probably CAN map a tongue-click language too)
I see you did it for english, but I'm wondering why everyone has stopped at a single language quantizer when it can probably be made into an omnilingual quantizer.
I ask in the name of languages like Klingon, Middle English, Old English, Vietnamese with a southern accent....