lcmmichielsen / scHPL

MIT License
34 stars 1 forks source link

treeArches on Tabula Sapiens #10

Open sarah-chapin opened 9 months ago

sarah-chapin commented 9 months ago

I am using treeArches to train a tree for a single-tissue (thymus) subset of the Tabula Sapiens dataset (https://tabula-sapiens-portal.ds.czbiohub.org/).

I followed the training steps listed here, but the scHPL.train.train_tree() function only returned a flat tree.

Additionally, I am using treeArches to integrate the Tabula Sapiens data with another published dataset, A cell atlas of human thymic development, to generate an integrated reference. Both data subsets are from the same tissue. However, the tree that is generated is incorrect and does not properly represent the relationships between known cell types.

My code for the integration of the two published datasets is below:

source_adata = ad.concat([tabula_sapiens_thymus, human_thymus_epithelial])
source_adata.raw = source_adata
sc.pp.normalize_total(source_adata)
sc.pp.log1p(source_adata)

sc.pp.highly_variable_genes(
    source_adata,
    n_top_genes=2000,
    batch_key="study",
    subset=True)

source_adata.X = source_adata.raw[:, source_adata.var_names].X

source_adata = source_adata.copy()
sca.models.SCVI.setup_anndata(source_adata, batch_key="batch")

vae = sca.models.SCVI(
    source_adata,
    n_layers=2,
    encode_covariates=True,
    deeply_inject_covariates=True,
    use_layer_norm="both",
    use_batch_norm="none",
)

vae.train(max_epochs=80)

reference_latent = sc.AnnData(vae.get_latent_representation())
reference_latent.obs["cell_type"] = source_adata.obs["cell_type"].tolist()
reference_latent.obs["batch"] = source_adata.obs["batch"].tolist()
reference_latent.obs["study"] = source_adata.obs["study"].tolist()

sc.pp.neighbors(reference_latent, n_neighbors=8)
sc.tl.leiden(reference_latent)
sc.tl.umap(reference_latent)

reference_latent.obs['celltype_batch'] = np.char.add(np.char.add(np.array(reference_latent.obs['cell_type'], dtype= str), '-'),
                                             np.array(reference_latent.obs['study'], dtype=str))

tree_ref, mp_ref = sca.classifiers.scHPL.learn_tree(data = reference_latent, 
                batch_key = 'study',
                batch_order = ['human-thymus-epi', 'tabula'],
                cell_type_key='celltype_batch',
                classifier = 'knn', dynamic_neighbors=True,
                dimred = False, print_conf= False)

Do you have any suggestions for improving the tree generated by treeArches, either for the analysis of the single dataset only or for the integration of the two? I am able to successfully integrate the second dataset (A cell atlas of human thymic development) with a third (unpublished) dataset and the tree is also generated successfully in that case, so I believe the issue is likely related to running the analysis Tabula Sapiens dataset.

Are there any specific factors to consider when using treeArches on a large cell atlas, like Tabula Sapiens?

lcmmichielsen commented 9 months ago

Issue 1: Flat tree when training on one dataset When you only use the train_tree function, no hierarchy is learned indeed. This function only trains the tree (the classifiers) you input. So when you follow the GitHub issue you mentioned, you input a flat tree, so the output will also be a flat tree. In the basic tutorial we explain how you can input a hierarchy (e.g. based on prior knowledge) as well here using the newick format. If you do this, you have to make sure that at least the names of the leaf nodes correspond exactly to cell type labels in your dataset. scHPL can only learn the hierarchy automatically when multiple datasets are used as input.

Issue 2 Incorrect tree

Hope this helps!