snap-stanford / UCE

UCE is a zero-shot foundation model for single-cell gene expression data
MIT License
136 stars 21 forks source link

Which ESM model to use for UCE input? #13

Closed vinettey closed 9 months ago

vinettey commented 9 months ago

Hi, I’m trying to run the saturn code to generate protein embeddings that I’ll use for input into UCE. I noticed that the saturn code uses a smaller model (esm1b_t33_650M_UR50S). I just wanted to confirm that you’re generating UCE protein embeddings using the biggest ESM2 model (https://huggingface.co/facebook/esm2_t48_15B_UR50D). I just replaced the line in the saturn code that specified the model with the 15B parameter ESM2 version. Is that correct Thank you for the extra information!

Yanay1 commented 9 months ago

Yes that is correct! UCE uses the 15B ESM2 model.

There are quite a few pre-calculated ESM2 embeddings here: https://drive.google.com/drive/folders/1_Dz7HS5N3GoOAG6MdhsXWY1nwLoN13DJ which might contain the species you are interested in.

Evenlyeven commented 3 weeks ago

Yes that is correct! UCE uses the 15B ESM2 model.

There are quite a few pre-calculated ESM2 embeddings here: https://drive.google.com/drive/folders/1_Dz7HS5N3GoOAG6MdhsXWY1nwLoN13DJ which might contain the species you are interested in.

Thank you for providing this great tool and for the detailed information shared in this issue. I have a follow-up question regarding the ESM2 models:

Do I need to use the largest ESM2 model (esm2_t48_15B_UR50D) for convert_protein_embeddings_to_gene_embeddings.py? I am asking because during a test run with the smaller model (esm2_t33_650M_UR50D), I encountered a KeyError: 48 in the script.

Is there something specific to the smaller model that might be causing this issue?

Thanks in advance for your help!