Few questions about the implementation

panamka commented 9 months ago

Hello! Thank you very much for the great work!

I want to try to train a model based on yours, but for speech coding. And I have a few questions.

I see a new release with localMHA. But is it possible to get good results with a model that was in the initial release without localMHA?
I also want to train at different sample rates. Should I change the architecture for different sample rates? For example, strides, dilation, or the number of codes?
And do I understand correctly that to make a causal version of the model, it is enough to make WNConv1d and WNConvTranspose1d causal, that is, to change the padding?

Thank you in advance

hubertsiuzdak commented 9 months ago

I added local attention on non-overlaping blocks and it really improved the quality of music in all my experiments. You can definitely get better quality without MHA, but probably it would require more codebooks (i.e. more bitrate). I just wanted as few tokens as possible to make audio generation with language models easier.

LocalMHA currently operates on chunks of about 0.3 seconds, so this latency might not be acceptable for some cases It's kind of a trade-off between lower bitrate (less tokens) and lower latency.

For lower sample rates, I'd suggest lowering the decoder_dim (and encoder), as the current parameter count is pretty high for vocoders. For speech also lower bitrate (less codebooks) should be enough
Yes

btw just so you know, I plan to release a speech codec next week

panamka commented 9 months ago

Thank you very much for the valuable comments. I will keep an eye on new releases.

hubertsiuzdak / snac

Few questions about the implementation #3