treeArches on Tabula Sapiens

I am using treeArches to train a tree for a single-tissue (thymus) subset of the Tabula Sapiens dataset (https://tabula-sapiens-portal.ds.czbiohub.org/).

I followed the training steps listed here, but the scHPL.train.train_tree() function only returned a flat tree.

Additionally, I am using treeArches to integrate the Tabula Sapiens data with another published dataset, A cell atlas of human thymic development, to generate an integrated reference. Both data subsets are from the same tissue. However, the tree that is generated is incorrect and does not properly represent the relationships between known cell types.

My code for the integration of the two published datasets is below:

source_adata = ad.concat([tabula_sapiens_thymus, human_thymus_epithelial])
source_adata.raw = source_adata
sc.pp.normalize_total(source_adata)
sc.pp.log1p(source_adata)

sc.pp.highly_variable_genes(
    source_adata,
    n_top_genes=2000,
    batch_key="study",
    subset=True)

source_adata.X = source_adata.raw[:, source_adata.var_names].X

source_adata = source_adata.copy()
sca.models.SCVI.setup_anndata(source_adata, batch_key="batch")

vae = sca.models.SCVI(
    source_adata,
    n_layers=2,
    encode_covariates=True,
    deeply_inject_covariates=True,
    use_layer_norm="both",
    use_batch_norm="none",
)

vae.train(max_epochs=80)

reference_latent = sc.AnnData(vae.get_latent_representation())
reference_latent.obs["cell_type"] = source_adata.obs["cell_type"].tolist()
reference_latent.obs["batch"] = source_adata.obs["batch"].tolist()
reference_latent.obs["study"] = source_adata.obs["study"].tolist()

sc.pp.neighbors(reference_latent, n_neighbors=8)
sc.tl.leiden(reference_latent)
sc.tl.umap(reference_latent)

reference_latent.obs['celltype_batch'] = np.char.add(np.char.add(np.array(reference_latent.obs['cell_type'], dtype= str), '-'),
                                             np.array(reference_latent.obs['study'], dtype=str))

tree_ref, mp_ref = sca.classifiers.scHPL.learn_tree(data = reference_latent, 
                batch_key = 'study',
                batch_order = ['human-thymus-epi', 'tabula'],
                cell_type_key='celltype_batch',
                classifier = 'knn', dynamic_neighbors=True,
                dimred = False, print_conf= False)

Do you have any suggestions for improving the tree generated by treeArches, either for the analysis of the single dataset only or for the integration of the two? I am able to successfully integrate the second dataset (A cell atlas of human thymic development) with a third (unpublished) dataset and the tree is also generated successfully in that case, so I believe the issue is likely related to running the analysis Tabula Sapiens dataset.

Are there any specific factors to consider when using treeArches on a large cell atlas, like Tabula Sapiens?

Issue 1: Flat tree when training on one dataset When you only use the train_tree function, no hierarchy is learned indeed. This function only trains the tree (the classifiers) you input. So when you follow the GitHub issue you mentioned, you input a flat tree, so the output will also be a flat tree. In the basic tutorial we explain how you can input a hierarchy (e.g. based on prior knowledge) as well here using the newick format. If you do this, you have to make sure that at least the names of the leaf nodes correspond exactly to cell type labels in your dataset. scHPL can only learn the hierarchy automatically when multiple datasets are used as input.

Issue 2 Incorrect tree

Did you check if the data is integrated correctly? If the integration doesn't look good, scHPL won't be able to match the cell types correctly either.
What kind of mistakes are made? Are there missing cell types or are there weird matches between cell types? Sometimes weird matches can be explained by wrong original annotations. For instance, in our original publication, we saw that cell-type labels of some populations got swapped. You could visualize marker genes for the wrongly matched cell types and see if that is the case.
If dataset 2 and 3 work well, you could also swap the order of the datasets. Usually, this does not have a big influence, but if the first dataset is the problematic one, things might improve slightly. You could also try to play around with the parameters (e.g. different number of k), but if there are weird matches I doubt that this will make a difference.
scHPL apparently breaks with pandas 2.0. In the requirements file, I added pandas < 2.0 as a requirement now. If you have pandas 2.0 installed, I would suggest downgrading it.

Hope this helps!

lcmmichielsen / scHPL

treeArches on Tabula Sapiens #10