Genes without protein LLM embeddings

bschilder commented 1 month ago

Hello @Yanay1 @yhr91 !

I was wondering how UCE handles genes that do not have a corresponding protein embedding (eg non-coding genes, or genes with symbols that don't match with the provided protein embedding reference IDs). Does it simply drop these genes during preprocessing of the anndata object?

In normal circumstances dropping these genes may not be too much of an issue due to the expression of plenty of other genes to use for inference. But I have a bit of an unusual case in that some of my "cells" (they're actually vectors of gene associations for disease traits) only have 1 gene. So if that 1 gene gets dropped, the whole trait gets dropped. Also curious to hear your thoughts on whether you have any other concerns about trying to embed 1-gene "cells" with UCE to begin with.

Is there a way to retain these genes that wouldn't introduce too much bias (genes with protein embeddings vs those without)? Or is this something that's simply too deeply ingrained within the UCE?

Thanks so much!, Brian

Yanay1 commented 1 month ago

Hi Brian,

Unfortunately we are limited now to just protein coding genes that have a reference AA sequence so that we can use a protein embedding for them (like in SATURN).

Even so-- I don't think the model will be able to give a meaningful embedding of a "cell" with just one gene in it-- that gene would be repeated 1024 times in the sample and it may be a very weird end result.

bschilder commented 1 month ago

Makes sense, thanks! It seems like UCE may be dropping phenotypes with few genes anyway. I'm inferring this from the fact that my input anndata object has the shape 22773 × 18826, and the output object has shape 11963 × 18160 (even when i set --filter False)

snap-stanford / UCE

Genes without protein LLM embeddings #39