Open GlancerZ opened 3 months ago
Here are a couple strategies:
I've got good results by simply putting the CLS token at the end of a sequence and using only that token's embedding. Even more so I've got good results by placing multiple CLS tokens within one sequence to extract multiple sub-sequence embeddings with some relations between sub-sequences. In my case, each sub-sequence may depend on its content and something from previous sub-sequences. If you need to capture full cross-sub-sequence relations you'd need to feed the entire sequence into the model twice - the first time to let Mamba learn what's in the sequence and the second time collecting your results. To work with Mamba properly you need to remember the sequential nature of this model (which I consider it's one of the most powerful attributes):
Compared to the method of using the CLS token provided by BERT to extract the entire sentence embedding, is Mamba's method of placing the CLS token effective? My intuition is that the CLS token in Mamba cannot directly interact with each word's token, so its effectiveness might be poor. Therefore, would extracting the last hidden state be more effective? Thanks!