Open ekg opened 6 months ago
I'm not sure which form of deriving embeddings you're thinking of. For most ways that you could apply a Transformer, you could also apply Mamba. One exception may be if you need an explicit bidirectional instead of causal model e.g. for MLM (BERT-style) pretraining. We're working on the proper way to do this but you could always just concatenate or add two copies of Mamba (one running in reverse direction), just like how this used to be handled with RNNs.
A bidirectional (Bert/MLM or Electra/RTD) pretraining setup model with Mamba would be amazing!
It's coming soon!
What would be the best way to derive embeddings from mamba models? Is there a straightforward approach or would we need a new architecture?