snap-stanford / UCE

UCE is a zero-shot foundation model for single-cell gene expression data
MIT License
136 stars 21 forks source link

Mismatch error when running UCE on ESM embedding of a new species. #16

Closed vinettey closed 8 months ago

vinettey commented 8 months ago

Hi! I encountered a mismatch error when running UCE on ESM embedding of a new species. RuntimeError: Error(s) in loading state_dict for TransformerModel: size mismatch for pe_embedding.weight: copying a param with shape torch.Size([145469, 5120]) from checkpoint, the shape in current model is torch.Size([19910, 5120]).

I generated protein embeddings with the ESM2 model by the following codes:

Screen Shot 2024-01-21 at 8 41 31 AM

Could you please help checking what might went wrong here? Thanks!

Yanay1 commented 8 months ago

Are you using the most up to data version of the repo? Were you able to walk through the notebook on embedding new species?

vinettey commented 8 months ago

Yes! I was able to walk through the notebook and generate the files.

Yanay1 commented 8 months ago

What is the command to launch UCE that you are using? Could you also please double check that this on the most recent version of the repo? Thanks!

vinettey commented 8 months ago

This is the command to launch UCE.

Screen Shot 2024-01-22 at 10 19 25 AM
Yanay1 commented 8 months ago

Can you upload a screenshot of the full error you get when you try to run that? Thanks!

vinettey commented 8 months ago

This is the error message output by running UCE.

Screen Shot 2024-01-23 at 3 43 13 PM
Yanay1 commented 8 months ago

Are you sure you are using the most recent version of the repo? Did you modify the model files at all?

In the current code we added:

 empty_pe = torch.zeros(145469, 5120)
 empty_pe.requires_grad = False
 model.pe_embedding = nn.Embedding.from_pretrained(empty_pe)
 model.load_state_dict(torch.load(args.model_loc, map_location="cpu"),
                        strict=True)

So I'm not sure how there can be a mismatch to the model there. Maybe try redownloading the model?

vinettey commented 8 months ago

Hi! Updating to the latest github solves the problem of the dimension mismatch. But it gives another error on embedding generated for a new species.

Screen Shot 2024-01-26 at 3 30 11 PM
Yanay1 commented 8 months ago

Please see the response here: https://github.com/snap-stanford/UCE/issues/18#issuecomment-1910796722

This error happens when there is a cell with 0 genes expressed.

vinettey commented 8 months ago

Hi! I checked my gene x cell matrix and there's no cells with 0 genes. Could you share the exact code of generating ESM protein embeddings for a new species? I want to make sure there's no mismatch in gene names. Thanks!

Yanay1 commented 8 months ago

If you look at the UCE output in terminal when processing the dataset, it will output the number of genes matched.

You can also call

torch.load("path to the protein embedding dataset")

to load in the protein embedding dataset which will list the gene names that are filtered.

vinettey commented 8 months ago

Thanks for the suggestion! I found the cause for this problem. The wrong adata file was used in the first place (X contains scaled data not count) and the intermediate file was not updated after correcting the adata file. Thank you again for the help!