Does VALL-E follow the same semantic/coarse hierarchical structure as AudioLM?

williamluer commented 1 year ago

AudioLM and VALL-E take similar approaches to audio generation with language models and it seems like you were able to use these similarities to allow for text-conditioning of AudioLM. However in the VALL-E paper I do not see any reference to semantic token estimation/prediction with w2v-BERT or HuBERT. I only see the EnCodec embeddings used as intermediate representations of the audio.

Do you know if the authors of VALL-E leveraged HuBERT embeddings or was this a design choice on your end to simplify implementations?

lucidrains commented 1 year ago

@williamluer yea, i'm no longer following papers to the letter and just following intuition now

each paper has their main idea, so i'm mixing and matching the core proposals

lucidrains commented 1 year ago

basically all the VALL-E paper showed is that their specific way of attention conditioning works. i don't think it matters whether it is semantic or acoustic tokens

williamluer commented 1 year ago

Understood thank you!

lucidrains / audiolm-pytorch

Does VALL-E follow the same semantic/coarse hierarchical structure as AudioLM? #233