instadeepai / nucleotide-transformer

🧬 Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics
https://www.biorxiv.org/content/10.1101/2023.01.11.523679v2
Other
480 stars 55 forks source link

Inquiry Regarding Details of Section A.5.4 #34

Closed yangzhao1230 closed 8 months ago

yangzhao1230 commented 1 year ago

I am particularly intrigued by the experiments outlined in section A.5.4, which focuses on Functional Variant Prioritization.

I am particularly intrigued by the experiments outlined in section A.5.4, which focuses on Functional Variant Prioritization. As I attempt to replicate this specific experiment, I have encountered some challenges and would greatly appreciate additional details to aid in my efforts. Specifically, I am interested in the following aspects:

  1. Embedding Extraction:

Could you please clarify from which layer of the Transformer the embeddings are extracted?

  1. Similarity Calculation:

In the calculation of similarity, is it based solely on the embeddings of tokens that have undergone mutations, or does it encompass the similarity of embeddings for the entire sequence?

  1. Binary Similarity Threshold:

What threshold value is employed for binary similarity in the two-class classification? Understanding this threshold is crucial for my replication efforts.

I have observed that the similarity between sequences with severe mutations tends to be exceptionally high (exceeding 0.999). To gain a deeper understanding and enhance the reproducibility of this experiment, I would be grateful for any additional insights or details you could provide.

JavierMenRev commented 9 months ago

Sorry for the late reply, @yangzhao1230

Regarding embedding extraction. We pulled out layers 12, 16, 21, 24, and 32. For the results shown in the figures, we used the layer that resulted in the highest performance for each score separately.

Regarding similarity calculation. We used the embeddings from the token containing the mutation.

Regarding the similarity threshold. Let me know if this is what you're referring to. But for our ROC analyses we used the scores as is. We didn't use a cutoff to classify the variants.

Hope this helps.