instadeepai / nucleotide-transformer

🧬 Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics
https://www.biorxiv.org/content/10.1101/2023.01.11.523679v2
Other
480 stars 55 forks source link

Can I simply take 'CLS' token as my sequence representations? #33

Closed Azai-yx closed 11 months ago

Azai-yx commented 1 year ago

Hello there, I attempted to extract representations from the nucleotide transformer, particularly utilizing the 250 million multi-species model. Is there a suggested method for retrieving representations from embeddings, or would it be more effective to use the CLS token as a representation for my sequences? To provide more context, these representations I'm seeking to extract are intended as initial embeddings for downstream tasks. The sequence lengths I'm working with vary significantly, ranging from 10 base pairs to several thousand base pairs.

Thanks in advance!

Azai-yx commented 11 months ago

I have tried CLS token, mean pooling, max pooling and found their performce roughly the same