snap-stanford / UCE

UCE is a zero-shot foundation model for single-cell gene expression data
MIT License
136 stars 21 forks source link

About New Species #20

Closed zhonghuaxiaodangjiayo closed 7 months ago

zhonghuaxiaodangjiayo commented 7 months ago

Hello! Quite an exciting job. Here I have some questions to ask: I noticed that the cellular representation of green monkeys is embedded in your article, but green monkeys are not in your training species and there is no protein embedding for that species available in ESM2. Did you run an ESM2 model of the green monkey for it, or did you set the species to human?

Yanay1 commented 7 months ago

Hi!

We included green monkey (as well as naked mole rat and chicken) as examples of UCE's ability to embed new species that were not in the training data.

To embed a new species not in the training data, please follow the example in the vignette here: (https://github.com/snap-stanford/UCE/blob/main/data_proc/Create%20New%20Species%20Files.ipynb)

You can find ESM2 protein embeddings, including that of Green Monkey (the genome/proteome used is named 'Chlorocebus sabaeus') here: https://drive.google.com/drive/folders/1_Dz7HS5N3GoOAG6MdhsXWY1nwLoN13DJ

If there is a new species not included there that you would like protein embeddings for I'd be happy to create it and upload it. You can also follow the instructions here: https://github.com/snap-stanford/SATURN/blob/main/protein_embeddings/Generate%20Protein%20Embeddings.ipynb if you wish to create new protein embeddings yourself.

zhonghuaxiaodangjiayo commented 7 months ago

Thanks for the reply!!!

We are interested in how the petromyzontidae compares to other species and would like to use UCE to get their cellular embedding, we'd appreciate it if you guys could upload the protein embed of the petromyzontidae!!!

Another question, I noticed that the protein embeddings that UCE already have are generated with the ESM2 model and SATURN is another tool that provides protein embeddings, would it be inappropriate to do cross-species analyses when using protein space provided by a different model, have you tested this?

Thx!!!

Yanay1 commented 7 months ago

SATURN and UCE both use the same ESM2 protein embeddings as an input.

Do you have a specific species/genome of petromyzontidae in mind? Would this work: https://useast.ensembl.org/Petromyzon_marinus/Info/Index ?

zhonghuaxiaodangjiayo commented 7 months ago

The URL you provided is fine.

Once again, thank you sincerely for your work!