snap-stanford / UCE

UCE is a zero-shot foundation model for single-cell gene expression data
MIT License
120 stars 15 forks source link

Task we can address based on the embeddings #7

Closed HelloWorldLTY closed 6 months ago

HelloWorldLTY commented 6 months ago

Hi, I wonder what types of tasks we can address based on the current embeddings. It seems that in the original paper UCE has multiple functions, but I am not sure what types of tasks need fine-tuning. Thanks.

Yanay1 commented 6 months ago

The UCE model is not fine tuned for any tasks-- the core embedding model is always zero shot.

Given one dataset, you can do a variety of tasks like clustering cell types or inferring hierarchies (using default scanpy functions for example).

Given embeddings of multiple datasets, you can do additional tasks like transferring labels using simple models (we used logistic classifier from SK Learn), model free transfer like using nearest neighbors or centroids based methods, plus the other single dataset tasks in a multi dataset context.

There are many other possible downstream tasks. As an example, for any model which would have previously used gene expression as a representation of a cell, you can replace the gene expression with UCE, train the model, and then apply it to other UCE embedded datasets.

For mutli dataset tasks, we replaced a large sample of the IMA which you can download from here: https://github.com/snap-stanford/UCE#data

HelloWorldLTY commented 6 months ago

Hi, got it, thanks for your quick feedback.

Is it possible for me to access your codes for using the logistic regression classifier to do cell type annotation? I intend to reproduce this process with unbiased design. Thanks.

Yanay1 commented 6 months ago

Is there a specific result or dataset from the paper you want to replicate?

The method is the same as is listed in SK Learn documentation, for example:

from sklearn.linear_model import LogisticRegression
X = reference_ad.obsm["X_uce"] # for some uploaded version of IMA, the UCE embeds are just in .X, but others are in .obsm["X_uce"]

y = reference_ad.obs["..."] # some metadata label like cell type

clf = LogisticRegression(random_state=0).fit(X, y)

predicted_y = clf.predict(transfer_ad.obsm["X_uce"])
HelloWorldLTY commented 6 months ago

Thanks a lot. I ask this question because logisticregression also has its own hyper-parameters.

Yanay1 commented 6 months ago

Ah that makes sense! For all uses of logistic regression / classification, the default parameters for sklearn were used.