Hey, I'm training Semantic transformer with 3 second of data, and I noticed that in inference when i'm generating tokens the transformer is generating around the same number of tokens up to 3 seconds.
So if i give the model a prompt audio of 1 second it will generate around 60 tokens, and if i'll give it 5 seconds it won't generate at all.
Did it Happen to anyone? Is it a bug in my repo or that is an outcome of training with fixed length? any idea how to solve it?
BTW - similar thing happens also in the Coarse transfomer.
Hey, I'm training Semantic transformer with 3 second of data, and I noticed that in inference when i'm generating tokens the transformer is generating around the same number of tokens up to 3 seconds. So if i give the model a prompt audio of 1 second it will generate around 60 tokens, and if i'll give it 5 seconds it won't generate at all.
Did it Happen to anyone? Is it a bug in my repo or that is an outcome of training with fixed length? any idea how to solve it?
BTW - similar thing happens also in the Coarse transfomer.