lucidrains / audiolm-pytorch

Implementation of AudioLM, a SOTA Language Modeling Approach to Audio Generation out of Google Research, in Pytorch
MIT License
2.33k stars 249 forks source link

Question about length of data in training \ generating #208

Closed amitaie closed 10 months ago

amitaie commented 1 year ago

Hey, I'm training Semantic transformer with 3 second of data, and I noticed that in inference when i'm generating tokens the transformer is generating around the same number of tokens up to 3 seconds. So if i give the model a prompt audio of 1 second it will generate around 60 tokens, and if i'll give it 5 seconds it won't generate at all.

Did it Happen to anyone? Is it a bug in my repo or that is an outcome of training with fixed length? any idea how to solve it?

BTW - similar thing happens also in the Coarse transfomer.