snap-stanford / UCE

UCE is a zero-shot foundation model for single-cell gene expression data
MIT License
120 stars 15 forks source link

Reproduce Logistic Classifier results #17

Closed v-mahughes closed 4 months ago

v-mahughes commented 5 months ago

Is it possible to provide the code or architecture for the logistic classifier you used for downstream evaluation of this model embeddings?

Yanay1 commented 5 months ago

The logistic classifier is implemented as the default sklearn classifier, following the example on this page: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

v-mahughes commented 5 months ago

Im having trouble replicating the results in supplementary figure 2b. Would you mind sharing the logistic classifier + umap code you used to generate that figure? Were the embeddings generated with the 33 layer or 4 layer model?

Yanay1 commented 5 months ago

For the paper all results were with the 33 layer model.

Unfortunately we cannot share the data for tabula sapiens v2 just yet.

To generate silhouette scores:

from sklearn.metrics import silhouette_samples
def run_benchmark(ad):
    X = ad.obsm["X_embed"]
    labels=ad.obs["cell_ontology_class"]
    scores = (silhouette_samples(X, labels)+ 1) / 2
    return {labels[i]:scores[i] for i in range(len(scores))}

(the +1 and /2 are to keep in line with SCIB).

To generate the UMAPS:

fig, (ax1, ax2, ax3) = plt.subplots(1, 3, sharex=False, sharey=False, figsize=(12, 3), frameon=False, dpi=600)
sc.pl.umap(geneformer_ad_new, color="cell_ontology_class", groups=["b cell"], ax=ax1, show=False)
sc.pl.umap(scgpt_ad_new, color="cell_ontology_class", groups=["b cell"], ax=ax2, show=False)
sc.pl.umap(uce_ad_new, color="cell_ontology_class", groups=["b cell"], ax=ax3, show=False)
plt.savefig("figures/sfig_bcells_comp_umap.svg")

To transfer labels:

from sklearn.linear_model import LogisticRegression
X, y = ica_uce, ica_cts
clf = LogisticRegression(random_state=0).fit(X, y)

ica_uce is the UCE embeddings for the immune cell atlas, ica_cts is the cell types.

tabula_pred = clf.predict(uce_b_cells.obsm["X_uce"])
uce_b_cells.obs["ica_pred_ct"] = tabula_pred

uce_b_cells is the subset of tabula sapiens v2 that was originally classified as b cells.

fig, ax = plt.subplots(1, 1, sharex=False, sharey=False, figsize=(4, 3), frameon=False, dpi=600)

sc.pl.umap(uce_b_cells, color="ica_pred_ct", legend_loc="on data", groups=["naive B cell", "memory B cell"], show=False, ax=ax);

plt.savefig("figures/sfig_bcell_preds.svg")
v-mahughes commented 5 months ago

Thank you! Which observation key did you use for training the logistic regression model on tissue immune atlas (i.e. "Manually_curated_celltype" or "Predicted_labels_Celltypist" etc.) ?

Yanay1 commented 5 months ago

The cell_type column from the dataset on cellxgene, which has 35 clusters. https://cellxgene.cziscience.com/e/1b9d8702-5af8-4142-85ed-020eb06ec4f6.cxg/

I'm not sure which of those two columns that translates to. There might have been some mapping of the original author annotations to the CellXGene cell type names when it was added to CellXGene

v-mahughes commented 5 months ago

Is the data set on that link TabulaSapiensv1? or v2?

Yanay1 commented 5 months ago

Supplementary figure 2b is on tabula sapiens v2 which is not public.

That link is to the cross-tissue immune atlas which is public: https://www.science.org/doi/10.1126/science.abl5197 and was used to transfer the more fine grained B cell annotations.

You can find the UCE embeddings for the immune atlas dataset using the CellXGene API: https://cellxgene.cziscience.com/census-models

I can also upload the h5ad with embeddings, which will be very slightly different than what is on the cellXgene website.

v-mahughes commented 5 months ago

oh yes right my apologies for the confusion. Does the h5ad you have include the cellXGene 'cell_type' labels? The original dataset from source only includes the labels listed under 'Author Categories' so im just looking to find out how to download those 'cell_type' labels

Yanay1 commented 5 months ago

If you download the dataset from https://cellxgene.cziscience.com/e/1b9d8702-5af8-4142-85ed-020eb06ec4f6.cxg/ it will have a cell_type column in the .obs!