Question regarding the configuration of decoder_retention_heads

Kratos-Wen commented 8 months ago

Thank you for your great work!

I've noticed that your decoder_retention_heads is set to 3 by default, and the mask is also expanded to three dimensions to match. Have you experimented with the performance differences under different numbers of heads? Is this configuration sufficient in terms of attention performance? Since your model is primarily used for sequence models in language processing, I am looking to extend its application to image processing. I'm unsure if I should make any modifications to this aspect.

Thank you in advance for your response.

jpokemon232 commented 7 months ago

when I was adjusting the configurations of Retnet I also ran into this issue. Can you make a assert that the decoder_embed_dim and decoder_value_embed_dim must be a multiple of decoder_retention_heads.

sunyt32 commented 7 months ago

@Kratos-Wen decoder_retention_heads affects key_diim, which is recommanded to set as 256.

microsoft / torchscale

Question regarding the configuration of decoder_retention_heads #84