a question about max_attn_resolution and crossattn layer numbers

thuml / iVideoGPT

Official repo for "iVideoGPT: Interactive VideoGPTs are Scalable World Models", https://arxiv.org/abs/2405.15223

MIT License

59 stars 3 forks source link

Hi, thank you for your interest in our work! You are correct. There are indeed two cross-attention blocks for the encoders but only one for the decoders. This wasn't an intentional design choice. Initially, the cross-attention mechanism was supposed to be applied to multi-scale features. However, I set the max_attn_resolution to 16 mainly to save memory usage. Despite this, the current architecture performs well in practice. I will conduct experiments with more cross-attention blocks (e.g., setting max_attn_resolution to 32) to see if this can further improve performance. Thank you for pointing this out to my attention!

thuml / iVideoGPT

a question about max_attn_resolution and crossattn layer numbers #3