state-spaces / mamba

Mamba SSM architecture
Apache License 2.0
12.45k stars 1.05k forks source link

How does mamba support cross attention? #229

Open lqniunjunlper opened 6 months ago

tridao commented 6 months ago

No we don't have an analogue to cross attention.

Zeno673 commented 5 months ago

Can you achieve it?If not, I don't think it would be a good general-purpose model for dealing with multi-modal data.

tridao commented 5 months ago

That's an open research question.

takfate commented 5 months ago

@Zeno673 Hello, we evaluate serving the mamba as bidirectional multi-modal encoder in our recent work: video-mamba-suite. We find that concatentating directly textual and visual tokens can effectively perform cross-modal interaction, for instance, in video temporal grounding tasks. So I believe simple concatentation may be a simple yet useful implementation of cross attention.

betterze commented 5 months ago

@tridao What about interleaving cross-attention layers with Mamba layers? Do you think this is a good idea? Thx a lot.

HuangChiEn commented 1 month ago

@Zeno673 Hello, we evaluate serving the mamba as bidirectional multi-modal encoder in our recent work: video-mamba-suite. We find that concatentating directly textual and visual tokens can effectively perform cross-modal interaction, for instance, in video temporal grounding tasks. So I believe simple concatentation may be a simple yet useful implementation of cross attention.

thanks for your sharing, it's nice to hear that. If further explicit design about cross-context dependent matrix (Time Variant A, B, C, D in ssm) could be proposed, it'll also be great ~

HuangChiEn commented 1 month ago

Can you achieve it?If not, I don't think it would be a good general-purpose model for dealing with multi-modal data.

Be honest, we should agree this comment (even this comment sounds not so nice). In self-attention, the core concept is leveraged by database like mechanism (Q, K, V mapping and cross-comparison). When the inputs come from different source, we can very easily to extend such design without too much mind overhead.

On the other head, mamba aims to deal with the shortcoming of self-attention, instead of database like mechanism itself. This lead to the specific improvement of the single functional module (rather than the entire new concept of general purpose module).

For the multimodal, it still suggested to leverage cross-attention mechanism, and integrate the mamba for the rest of token processing (self-attention module). Directly concate the tokens come from different modality, then processing by mamab also works fine, but it may leak of intuition of "how modality alignment works" without the concept of cross-attention.

albertfgu commented 1 month ago

These are interesting research questions. It depends on the application and data. Many tasks nowadays don't need explicit cross-attention or encoder-decoders, and instead can leverage simpler strategies like concatenating modalities, as mentioned above.

Theoretically, softmax cross-attention doesn't have a direct analog with Mamba. However, there may be interesting ways to create efficient linear variants of cross-attention (just like Mamba or SSD can be viewed as an efficient linear version of self-attention) by leveraging the state space duality (SSD) framework. For example by choosing certain structured $L$ masks.

HuangChiEn commented 1 month ago

These are interesting research questions. It depends on the application and data. Many tasks nowadays don't need explicit cross-attention or encoder-decoders, and instead can leverage simpler strategies like concatenating modalities, as mentioned above.

Theoretically, softmax cross-attention doesn't have a direct analog with Mamba. However, there may be interesting ways to create efficient linear variants of cross-attention (just like Mamba or SSD can be viewed as an efficient linear version of self-attention) by leveraging the state space duality (SSD) framework. For example by choosing certain structured L masks.

Thanks for sharing your experience, it's helpful to explore more way for multi-modality interaction by using self-attention design ~ Maybe like you said, it's about selective mechanism (efficient linear variants of cross-attention). Cross-attention is a special case of self-attention with discarding some self-attention part (discard self-product between query and key of source and target input).

In general, cross-attention works better then simple concate with self-attention, since self-attention block doesn't have great memory & selective capability, discarding necessary information help self-attention block converge faster and precise.

On the other hand, mamba provide a excellent way to select and preserve the information, in this case more information may also help to robust the tokens representation for each modality. At least, it can degenerate to cross-attention itself, although more strictly proof maybe needed ~