Closed v-mahughes closed 4 months ago
The logistic classifier is implemented as the default sklearn classifier, following the example on this page: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
Im having trouble replicating the results in supplementary figure 2b. Would you mind sharing the logistic classifier + umap code you used to generate that figure? Were the embeddings generated with the 33 layer or 4 layer model?
For the paper all results were with the 33 layer model.
Unfortunately we cannot share the data for tabula sapiens v2 just yet.
To generate silhouette scores:
from sklearn.metrics import silhouette_samples
def run_benchmark(ad):
X = ad.obsm["X_embed"]
labels=ad.obs["cell_ontology_class"]
scores = (silhouette_samples(X, labels)+ 1) / 2
return {labels[i]:scores[i] for i in range(len(scores))}
(the +1 and /2 are to keep in line with SCIB).
To generate the UMAPS:
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, sharex=False, sharey=False, figsize=(12, 3), frameon=False, dpi=600)
sc.pl.umap(geneformer_ad_new, color="cell_ontology_class", groups=["b cell"], ax=ax1, show=False)
sc.pl.umap(scgpt_ad_new, color="cell_ontology_class", groups=["b cell"], ax=ax2, show=False)
sc.pl.umap(uce_ad_new, color="cell_ontology_class", groups=["b cell"], ax=ax3, show=False)
plt.savefig("figures/sfig_bcells_comp_umap.svg")
To transfer labels:
from sklearn.linear_model import LogisticRegression
X, y = ica_uce, ica_cts
clf = LogisticRegression(random_state=0).fit(X, y)
ica_uce
is the UCE embeddings for the immune cell atlas, ica_cts
is the cell types.
tabula_pred = clf.predict(uce_b_cells.obsm["X_uce"])
uce_b_cells.obs["ica_pred_ct"] = tabula_pred
uce_b_cells
is the subset of tabula sapiens v2 that was originally classified as b cells.
fig, ax = plt.subplots(1, 1, sharex=False, sharey=False, figsize=(4, 3), frameon=False, dpi=600)
sc.pl.umap(uce_b_cells, color="ica_pred_ct", legend_loc="on data", groups=["naive B cell", "memory B cell"], show=False, ax=ax);
plt.savefig("figures/sfig_bcell_preds.svg")
Thank you! Which observation key did you use for training the logistic regression model on tissue immune atlas (i.e. "Manually_curated_celltype" or "Predicted_labels_Celltypist" etc.) ?
The cell_type
column from the dataset on cellxgene, which has 35 clusters. https://cellxgene.cziscience.com/e/1b9d8702-5af8-4142-85ed-020eb06ec4f6.cxg/
I'm not sure which of those two columns that translates to. There might have been some mapping of the original author annotations to the CellXGene cell type names when it was added to CellXGene
Is the data set on that link TabulaSapiensv1? or v2?
Supplementary figure 2b is on tabula sapiens v2 which is not public.
That link is to the cross-tissue immune atlas which is public: https://www.science.org/doi/10.1126/science.abl5197 and was used to transfer the more fine grained B cell annotations.
You can find the UCE embeddings for the immune atlas dataset using the CellXGene API: https://cellxgene.cziscience.com/census-models
I can also upload the h5ad with embeddings, which will be very slightly different than what is on the cellXgene website.
oh yes right my apologies for the confusion. Does the h5ad you have include the cellXGene 'cell_type' labels? The original dataset from source only includes the labels listed under 'Author Categories' so im just looking to find out how to download those 'cell_type' labels
If you download the dataset from https://cellxgene.cziscience.com/e/1b9d8702-5af8-4142-85ed-020eb06ec4f6.cxg/ it will have a cell_type column in the .obs!
Is it possible to provide the code or architecture for the logistic classifier you used for downstream evaluation of this model embeddings?