Question about length of data in training \ generating

Hey, I'm training Semantic transformer with 3 second of data, and I noticed that in inference when i'm generating tokens the transformer is generating around the same number of tokens up to 3 seconds. So if i give the model a prompt audio of 1 second it will generate around 60 tokens, and if i'll give it 5 seconds it won't generate at all.

Did it Happen to anyone? Is it a bug in my repo or that is an outcome of training with fixed length? any idea how to solve it?

BTW - similar thing happens also in the Coarse transfomer.

lucidrains / audiolm-pytorch

Question about length of data in training \ generating #208