Open edofazza opened 6 months ago
I am also interested on the answer of this question. Let's say we have a model with a Transformer-based backbone and we want to integrate the Mamba module to it. Do we use Mamba instead of the whole Transformer layer, or we swap only the Attention layer?
I want modify an architecture, which passes a tensor x of size (8, 16, 512) (the first value is the batch) and a query_embed parameter of size (8, 140, 512) through a torch.nn.Transformer layer with d_model set to 512 and all the other parameters equals those indicated in the paper "Attention is all you need", using Mamba layers. How can I build this correspondence between the Transformer layer and an architecture based on Mamba? Thank you.