lucidrains / audiolm-pytorch

Implementation of AudioLM, a SOTA Language Modeling Approach to Audio Generation out of Google Research, in Pytorch
MIT License
2.32k stars 249 forks source link

Does VALL-E follow the same semantic/coarse hierarchical structure as AudioLM? #233

Closed williamluer closed 9 months ago

williamluer commented 9 months ago

AudioLM and VALL-E take similar approaches to audio generation with language models and it seems like you were able to use these similarities to allow for text-conditioning of AudioLM. However in the VALL-E paper I do not see any reference to semantic token estimation/prediction with w2v-BERT or HuBERT. I only see the EnCodec embeddings used as intermediate representations of the audio.

Do you know if the authors of VALL-E leveraged HuBERT embeddings or was this a design choice on your end to simplify implementations?

lucidrains commented 9 months ago

@williamluer yea, i'm no longer following papers to the letter and just following intuition now

each paper has their main idea, so i'm mixing and matching the core proposals

lucidrains commented 9 months ago

basically all the VALL-E paper showed is that their specific way of attention conditioning works. i don't think it matters whether it is semantic or acoustic tokens

williamluer commented 9 months ago

Understood thank you!