Closed williamluer closed 1 year ago
@williamluer yea, i'm no longer following papers to the letter and just following intuition now
each paper has their main idea, so i'm mixing and matching the core proposals
basically all the VALL-E paper showed is that their specific way of attention conditioning works. i don't think it matters whether it is semantic or acoustic tokens
Understood thank you!
AudioLM and VALL-E take similar approaches to audio generation with language models and it seems like you were able to use these similarities to allow for text-conditioning of AudioLM. However in the VALL-E paper I do not see any reference to semantic token estimation/prediction with w2v-BERT or HuBERT. I only see the EnCodec embeddings used as intermediate representations of the audio.
Do you know if the authors of VALL-E leveraged HuBERT embeddings or was this a design choice on your end to simplify implementations?