cellhint：单细胞整合分析天花板

cellhint：单细胞整合分析天花板 by 生信随笔

关于cellhint这个整合算法，我最早在预印本看到过https://www.biorxiv.org/content/10.1101/2023.05.01.538994v2。通讯作者是Sarah A. Teichmann大佬，关于她，不认识的小伙伴可以参考【生信进阶之路】盘点生物信息大牛课题组 EP16：Sarah A. Teichmann。

Teichmann团队的代表作/合作算法有BBKNN算法，celltypist算法，cellphoneDB等：

作为人类细胞图谱（the Human Cell Atlas，HCA）国际联盟的联合创始人和主要负责人，Teichmann团队的重心并不是开发算法，而是绘制人类单细胞组织图谱，包括：

心脏【Cells of the adult human heart】：DOI: 10.1038/s41586-020-2797-4
心脏【Spatially resolved multiomics of human cardiac niches】：DOI: 10.1038/s41586-023-06311-1
肺【A cellular census of human lungs identifies novel cell states in health and in asthma】：DOI: 10.1038/s41591-019-0468-5
肺【A spatially resolved atlas of the human lung characterizes a gland-associated immune niche】：DOI: 10.1038/s41588-022-01243-4
全身【Cross-tissue immune cell analysis reveals tissue-specific features in humans】：DOI: 10.1126/science.abl5197
全身【Automatic cell-type harmonization and integration across Human Cell Atlas datasets】：DOI: https://doi.org/10.1016/j.cell.2023.11.026

基本都是CNS或者是大子刊，更多内容可以参考她的谷歌学术档案：https://scholar.google.co.uk/citations?user=ZMEr7wIAAAAJ&hl=en

而cellhint这个整合算法流程，最近竟然发在了Cell上（有点震惊），我认为cellhint应该是整合算法的天花板了：

关于单细胞整合算法，我写过一个系列，包括：

接下来，我们用实战代码介绍一下这个算法流程。

一. 环境部署

cellhint是一个python包，环境部署非常简单：

conda create -n cellhint
conda activate cellhint
pip install cellhint
pip install celltypist #额外安装一下celltypist

如果需要衔接jupyter使用的话，安装一下相应的包：

conda install -y nb_conda_kernels ipykernel
python -m ipykernel install --user --name cellhint --display-name cellhint

二. 运行示例数据

import scanpy as sc
import cellhint

示例数据是一个脾脏组织的20W+单细胞，来自4个不同的数据集：

adata = sc.read('cellhint_demo_folder/Spleen.h5ad', backup_url = 'https://celltypist.cog.sanger.ac.uk/Resources/Organ_atlas/Spleen/Spleen.h5ad')
adata

 AnnData object with n_obs × n_vars = 200664 × 74369
    obs: 'Dataset', 'donor_id', 'development_stage', 'sex', 'suspension_type', 'assay', 'Original_annotation', 'CellHint_harmonised_group', 'cell_type', 'Curated_annotation', 'organism', 'disease', 'tissue'
    var: 'exist_in_Madissoon2020', 'exist_in_Tabula2022', 'exist_in_DominguezConde2022', 'exist_in_He2020'
    uns: 'schema_version', 'title'
    obsm: 'X_umap'

adata.obs.Dataset.value_counts()
#Madissoon et al. 2020          92049
#Dominguez Conde et al. 2022    70099
#Tabula Sapiens 2022            34004
#He et al. 2020                  4512
#Name: Dataset, dtype: int64

对数正态化基因表达（标准化至每个细胞的10,000计数）存储在 .X 中，而原始Count数据存储在 .raw 中。CellHint不依赖于后者，但为了确保单细胞分析的完整性，我们仍然从原始Count数据开始进行操作：

adata = adata.raw.to_adata()

del adata.var
del adata.uns
del adata.obsm
adata

 AnnData object with n_obs × n_vars = 200664 × 74369
    obs: 'Dataset', 'donor_id', 'development_stage', 'sex', 'suspension_type', 'assay', 'Original_annotation', 'CellHint_harmonised_group', 'cell_type', 'Curated_annotation', 'organism', 'disease', 'tissue'

这个adata就是一般我们输入到python的最原始的scanpy对象了，然后对adata执行标准分析流程：

sc.pp.normalize_total(adata, target_sum = 1e4)
sc.pp.log1p(adata)
adata.raw = adata
sc.pp.highly_variable_genes(adata, batch_key = 'Dataset', subset = True)
sc.pp.scale(adata, max_value = 10)
sc.tl.pca(adata)
sc.pp.neighbors(adata)
sc.tl.umap(adata)

根据数据集或者患者ID进行分组可视化umap结果：

sc.pl.umap(adata, color = ['Dataset', 'donor_id'], wspace = 0.5)

这个数据集还有预注释的结果：

sc.pl.umap(adata, color = ['Curated_annotation'], wspace = 0.5)

整体上看，批次还是比较严重的。然后我们进行cellhint整合去批次：

# Integrate cells with `cellhint.integrate`.
cellhint.integrate(adata, 'Dataset', 'Curated_annotation')

一句函数就能完成整合了，这个流程非常快，只需要1分钟不到的时间。

需要注意的是，cellhint需要指定批次，这里用的是'Dataset'；还需要指定预注释的信息，这里是'Curated_annotation'。除了使用'Dataset'这个作为批次的话，我们还可以使用患者/样本为批次，'donor_id'，例如：

#cellhint.integrate(adata, 'donor_id', 'Curated_annotation')

整合完之后，重新跑一下umap：

sc.tl.umap(adata)

sc.pl.umap(adata, color = ['Dataset', 'donor_id'], wspace = 0.5)

sc.pl.umap(adata, color = 'Curated_annotation')

可以看到，按照细胞类型进行聚类了。

三. 使用CellTypist自动注释

前面也提到了，CellTypist这个算法也是Teichmann团队提出的，关于CellTypist我之前也做过介绍，这个自动注释算法真的挺准确的：

Celltypist：超越singleR的单细胞注释工具

因此，cellhint流程也把CellTypist算法纳入了，对于首次使用celltypist的用户来说，需要下载一下官方的model（https://www.celltypist.org/models）：

import celltypist
from celltypist import models
models.download_models(force_update = True)
models.models_path # 模型会下载到家目录'~/.celltypist/data/models'
models.models_description() # model description

如果已经部署过celltypist的model的用户来说，跳过下载步骤，直接运行：

import celltypist
adata = celltypist.annotate(adata, model = 'Immune_All_Low.pkl', majority_voting = True).to_adata()

20w+单细胞注释，大概需要5-10分钟，结果储存在adata.obs：

adata.obs[['predicted_labels', 'majority_voting', 'conf_score']]

如果用户的单细胞数据是没有预注释信息的，完全可以使用celltypist进行预注释，然后进行cellhint整合，整合效果非常好：

# You can also set 'predicted_labels' here in addition to 'majority_voting'.
cellhint.integrate(adata, 'donor_id', 'majority_voting')
sc.tl.umap(adata)

sc.pl.umap(adata, color = ['Dataset', 'donor_id'], wspace = 0.5)

sc.pl.umap(adata, color = 'majority_voting')

sc.pl.umap(adata, color = 'Curated_annotation')

四. 使用CellHint harmonisation指导整合分析

除了使用上述预注释的结果进行整合，或者是celltypist的预测结果进行整合，CellHint 还可以使用harmonisation模式指导整合分析，运行速度也是非常快：

alignment = cellhint.harmonize(adata, 'Dataset', 'Original_annotation')

可视化harmonisation的结果：

cellhint.treeplot(alignment)

这个图表表明在数据整合过程中可以视为对应的细胞类型。

重要的是，细胞重新注释的信息存储在alignment.reannotation中:

alignment.reannotation

我们可以把上述结果中的reannotation和group添加到adata单细胞对象里：

adata.obs[['reannotation', 'group']] = alignment.reannotation[['reannotation', 'group']].loc[adata.obs_names]

查看一下：

adata.obs.iloc[:, -2:]

最后，我们基于harmonisation的结果进行整合：

cellhint.integrate(adata, 'donor_id', 'reannotation')
sc.tl.umap(adata)

sc.pl.umap(adata, color = ['Dataset', 'donor_id'], wspace = 0.5)

sc.pl.umap(adata, color = 'Curated_annotation')

sc.pl.umap(adata, color = 'group', legend_loc = 'on data')
sc.pl.umap(adata, color = 'reannotation')

五. 总结

还记得我之前评价过依赖GPU的明星整合算法scVI和scANVI，没有GPU运行真的太慢了：

单细胞多样本整合之scVI和scANVI

而cellhint的逻辑理念和运行速度让我大为震撼，在Cell原文中，作者还与其他整合算法做了横向比较：

cellhint算法绝对可以说是整合算法中的天花板了。cellhint本身的运行非常简单，加上联合celltypist这个自动注释天花板，这里我大胆预言一波，cellhint整合算法绝对要火~

另外cellhint的理念也让我有一点启发：Teichmann团队试图用多种算法来统一单细胞命名和整合不规范的问题，在这个基础上，把单细胞单一组织来源的整合，拓展到全身组织的整合，构建真正意义上的”人类细胞图谱“。

- END -

ixxmu / mp_duty