Closed ixxmu closed 10 months ago
我们马拉松授课的第四周转录组环节的新叶老师计划把其授课内容细化和延伸,进阶到单细胞转录组,所以钻研了几篇优秀的CNS文章的图表复现,数据分析环节有R和Python,都值得解读和分享,先把文字版分享给大家哈。
这篇文献提供的代码是R与python版本,是个极好的学习单细胞与python的资源。今天先来看看文献背景。
标题:A molecular single-cell lung atlas of lethal COVID-19
期刊:Nature
日期:2021/04/29
DOI:https://doi.org/10.1038/s41586-021-03569-1
通讯作者:Benjamin Izar Jianwen Que 哥伦比亚大学欧文医学中心血液学/肿瘤科医学系
代码:https://github.com/IzarLab/CUIMC-NYP_COVID_autopsy_lung
视频:https://www.youtube.com/watch?v=uvyG9yLuNSE
代码:https://github.com/mousepixels/sanbomics_scripts
主代码:https://github.com/mousepixels/sanbomics_scripts/blob/main/single_cell_analysis_complete_class.ipynb
呼吸衰竭是severe SARS-CoV-2感染患者死亡的主要原因,但肺组织水平的宿主反应尚不清楚。本文对19名死于COVID-19的患者和7名对照组患者的肺中约11.6万个细胞核进行了单核RNA测序。综合分析发现了细胞组成、转录细胞状态和细胞间相互作用的重大变化,从而为了解致命COVID-19的生物学提供了帮助。
看这篇文献的时候,心情突然有点那么沉重。
总共包含 COVID-19队列(19例样本)与 对照队列Control(7例样本)
使用单细胞核测序,经过质控流程,得到一个肺的图谱:116,314个单细胞,包括COVID-19队列79,636个细胞以及对照队列Control 36,678个细胞。9个主要的细胞类型鉴定如下:
具体的样本情况以及详细注释:
髓系细胞是COVID-19肺的主要细胞成分,且比对照组肺更为普遍。我们鉴定了:
它们在diffusion component(DC)分析中被恢复为不同的轨迹,在COVID-19肺部更常见(图2a-c)。来自COVID-19个体的髓系细胞被高度和异常激活。
此外:
这些数据表明,髓系细胞是COVID-19中炎症失调的主要来源。
使用the variable heavy (IGHV) and light (IGLV) chains鉴定了浆细胞,T/NK细胞群体分为了:CD8+ T cells (n = 3,561), T regulatory (Treg) cells (n = 649), other CD4+ T cells (n = 7,586), and NK cells (n = 2,141)。
作者发现COVID-19肺部的T细胞丰度没有显著增加,只有细胞因子和T细胞激活和组织驻留相关程序的适度上调(上图G-I)。
上皮细胞分为:
此外:在肺再生过程中,AT2细胞作为AT1细胞的祖细胞,可以分化为AT1;对照组肺中的AT2和AT1细胞形成了不同的簇。这其中有几个比较重要的基因:
在捕获的气道上皮细胞中,我们发现了四种不同的轨迹:
簇状细胞参与气道炎症和肠组织再生,但它们在病毒性肺炎中的作用尚不清楚。本研究中表现如下:
COVID-19肺部的成纤维细胞明显多于对照组(图1d),α-平滑肌肌动蛋白(α-SMA,基因名字ACTA2)免疫组化染色证实了这一发现:
纤维化程度(determined by a Sirius red fibrosis score)与疾病持续时间相关(图4a),表明COVID-19的肺纤维化随着时间的推移而增加:
此外:作者确定了五种成纤维细胞亚型:与对照肺相比,COVID-19肺中病理或中间病理成纤维细胞的频率增加(图4c)
主要细胞类型之间的配体-受体分析结果显示:
这里还有一个有意思的分析:
为了研究针对pFBs(pathological fibroblasts)的潜在治疗策略,作者从单核转录组中推断蛋白质活性,然后将pFBs与其他成纤维细胞进行比较。该分析预测pFBs将表现出JunB和JunD活性的增加(图j),它们通过增强TGFβ和STAT3信号通路诱导小鼠模型的肺纤维化,并与IL-1β的产生增加有关。最后,作者推断出pFBs中的药物作用靶点,并将MMP14和STAT3作为废除pFBs中有害程序的潜在靶点(图j)。
有好几个方法可以利用单细胞RNA数据推断蛋白层面的分子活性,下次介绍~
数据背景接上一篇:Python图文复现2022|01-文献阅读:致命COVID-19分子单细胞肺图谱
有三种途径:
这次就从GEO下载吧,下载完后:3个文件,一个处理后的csv表达数据,一个metadata,一个原始count数据压缩包tar
原文代码:https://github.com/IzarLab/CUIMC-NYP_COVID_autopsy_lung
但是本次我们使用一个利用这个数据讲python学习的资源,
视频相关代码如下:
数据下载后,开工!
# 使用conda安装好相关软件
conda activate scanpy
# 导入包,注意dir路径改成自己的
import scanpy as sc
dir = '/path/data/GSE171524/GSE171524_RAW/'
GSM5226574_C51样本是个肺对照样本,总共包含6099个细胞,34546个基因。
# 读取数据
adata = sc.read_csv(dir + 'GSM5226574_C51ctr_raw_counts.csv').T
adata
#AnnData object with n_obs × n_vars = 6099 × 34546
# 查看数据维度
adata.X.shape
#(6099, 34546)
# pip install scvi-tools
import scvi
# 过滤低表达基因以及高变基因选择
sc.pp.filter_genes(adata, min_cells = 10)
sc.pp.highly_variable_genes(adata, n_top_genes = 2000, subset = True, flavor = 'seurat_v3')
# 训练模型
scvi.model.SCVI.setup_anndata(adata)
vae = scvi.model.SCVI(adata)
vae.train()
#GPU available: False, used: False
#TPU available: False, using: 0 TPU cores
#IPU available: False, using: 0 IPUs
#HPU available: False, using: 0 HPUs
#Epoch 400/400: 100%|████████████████████████████████████████████████████████| 400/400 [23:10<00:00, 3.21s/it, loss=320, v_num=1]`Trainer.fit` stopped: `max_epochs=400` reached.
#Epoch 400/400: 100%|████████████████████████████████████████████████████████| 400/400 [23:10<00:00, 3.48s/it, loss=320, v_num=1]
solo = scvi.external.SOLO.from_scvi_model(vae)
solo.train()
df = solo.predict()
df['prediction'] = solo.predict(soft = False)
df.index = df.index.map(lambda x: x[:-2])
df
结果:doublet这一列的值越高,表明这个细胞约可能是双包体;
预测结果统计:
有1245个细胞被预测为双包体,占总细胞的20%左右,这对于10X来说,双包率有点太高了
df.groupby('prediction').count()
doublet singlet
prediction
doublet 1245 1245
singlet 4854 4854
因此,这里计算一个df值:
df['dif'] = df.doublet - df.singlet
df
新增一列df值:
给这个df绘制一个分布图:
import seaborn as sns
import matplotlib.pyplot as plt
sns.displot(df[df.prediction == 'doublet'], x = 'dif')
plt.savefig(dir+"01-df-displot.png")
结果图:
因此,将df大于1的预测为双包体:
doublets = df[(df.prediction == 'doublet') & (df.dif > 1)]
doublets
adata = sc.read_csv(dir+'GSM5226574_C51ctr_raw_counts.csv').T
adata.obs['doublet'] = adata.obs.index.isin(doublets.index)
adata.obs
预测结果:
去除:还剩余5618个细胞
adata = adata[~adata.obs.doublet]
adata
View of AnnData object with n_obs × n_vars = 5618 × 34546
obs: 'doublet'
# 计算线粒体基因比例
adata.var['mt'] = adata.var.index.str.startswith('MT-')
adata.var
# 核糖体RNA基因
import pandas as pd
ribo_url = "http://software.broadinstitute.org/gsea/msigdb/download_geneset.jsp?geneSetName=KEGG_RIBOSOME&fileType=txt"
ribo_genes = pd.read_table(ribo_url, skiprows=2, header = None)
ribo_genes
# 0
# 0 FAU
# 1 MRPL13
# 2 RPL10
# 3 RPL10A
# 4 RPL10L
# .. ...
# 83 RPS9
# 84 RPSA
# 85 RSL24D1
# 86 RSL24D1P11
# 87 UBA52
# [88 rows x 1 columns]
adata.var['ribo'] = adata.var_names.isin(ribo_genes[0].values)
接着计算qc指标
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt', 'ribo'], percent_top=None, log1p=False, inplace=True)
adata.var.sort_values('n_cells_by_counts')
结果如下:
低表达过滤:
sc.pp.filter_genes(adata, min_cells=3)
adata.var.sort_values('n_cells_by_counts')
adata.obs.sort_values('n_genes_by_counts')
# 绘制小提琴图
sc.pl.violin(adata, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt', 'pct_counts_ribo'], jitter=0.4, multi_panel=True)
plt.savefig(dir+"02-qc_violin.png")
结果图:
按照分位数来过滤细胞:
import numpy as np
upper_lim = np.quantile(adata.obs.n_genes_by_counts.values, .98)
#upper_lim = 3000
upper_lim
adata = adata[adata.obs.n_genes_by_counts < upper_lim]
adata.obs
过滤之后:
线粒体比例与核糖体比例:
adata = adata[adata.obs.pct_counts_mt < 20]
adata = adata[adata.obs.pct_counts_ribo < 2]
adata
过滤后:
View of AnnData object with n_obs × n_vars = 5489 × 24080
obs: 'doublet', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo'
var: 'mt', 'ribo', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells'
标准化前后的区别可看:adata.X.sum(axis = 1)值的变化
adata.X.sum(axis = 1)
#normalize every cell to 10,000 UMI
sc.pp.normalize_total(adata, target_sum=1e4)
adata.X.sum(axis = 1)
#change to log counts
sc.pp.log1p(adata)
adata.X.sum(axis = 1)
adata.raw = adata
# 高变基因选择以及可视化
sc.pp.highly_variable_genes(adata, n_top_genes = 2000)
sc.pl.highly_variable_genes(adata)
plt.savefig(dir+"03-highly_variable_genes.png")
结果图:
选择主成分:
adata = adata[:, adata.var.highly_variable]
sc.pp.regress_out(adata, ['total_counts', 'pct_counts_mt', 'pct_counts_ribo'])
sc.pp.scale(adata, max_value=10)
sc.tl.pca(adata, svd_solver='arpack')
sc.pl.pca_variance_ratio(adata, log=True, n_pcs = 50)
plt.savefig(dir+"04-pca_variance.png")
结果图:
这里选择30个PCs,然后聚类:
sc.pp.neighbors(adata, n_pcs = 30)
sc.tl.umap(adata)
sc.pl.umap(adata)
plt.savefig(dir+"05-umap.png")
结果图:
使用leiden在低维空间可视化:
sc.tl.leiden(adata, resolution = 0.5)
sc.pl.umap(adata, color=['leiden'])
plt.savefig(dir+"05-umap_leiden.png")
结果图:聚成了11类
单个样本的分析演示到这里,下期进行所有样本整合分析~
先定义一个函数,批量运行多个样本:这里一定要注意缩进问题
def pp(csv_path): adata = sc.read_csv(csv_path).T sc.pp.filter_genes(adata, min_cells = 10) sc.pp.highly_variable_genes(adata, n_top_genes = 2000, subset = True, flavor = 'seurat_v3') scvi.model.SCVI.setup_anndata(adata) vae = scvi.model.SCVI(adata) vae.train() solo = scvi.external.SOLO.from_scvi_model(vae) solo.train() df = solo.predict() df['prediction'] = solo.predict(soft = False) df.index = df.index.map(lambda x: x[:-2]) df['dif'] = df.doublet - df.singlet doublets = df[(df.prediction == 'doublet') & (df.dif > 1)] adata = sc.read_csv(csv_path).T adata.obs['Sample'] = csv_path.split('_')[2] #'raw_counts/GSM5226574_C51ctr_raw_counts.csv' adata.obs['doublet'] = adata.obs.index.isin(doublets.index) adata = adata[~adata.obs.doublet] sc.pp.filter_cells(adata, min_genes=200) #get rid of cells with fewer than 200 genes #sc.pp.filter_genes(adata, min_cells=3) #get rid of genes that are found in fewer than 3 cells adata.var['mt'] = adata.var_names.str.startswith('mt-') # annotate the group of mitochondrial genes as 'mt' adata.var['ribo'] = adata.var_names.isin(ribo_genes[0].values) sc.pp.calculate_qc_metrics(adata, qc_vars=['mt', 'ribo'], percent_top=None, log1p=False, inplace=True) upper_lim = np.quantile(adata.obs.n_genes_by_counts.values, .98) adata = adata[adata.obs.n_genes_by_counts < upper_lim] adata = adata[adata.obs.pct_counts_mt < 20] adata = adata[adata.obs.pct_counts_ribo < 2] return adata
这个过程比较久,会依次读取GSE171524数据集中的26例样本,并进行上面的函数里面定义的分析。
这个过程中就可以跑去听听视频了。
import os# 注意dir改成自己的路径dir = '/path/data/GSE171524/'out = []for file in os.listdir(dir+'raw_counts/'): out.append(pp(dir + 'raw_counts/' + file))
将多个样本连接合并在一起,合并后共有105264个细胞:
adata = sc.concat(out)adataAnnData object with n_obs × n_vars = 105264 × 29236 obs: 'Sample', 'doublet', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo' var: 'n_cells'
过滤并保存:
sc.pp.filter_genes(adata, min_cells = 10)from scipy.sparse import csr_matrixadata.X = csr_matrix(adata.X)adata.Xadata.write_h5ad(dir+'combined.h5ad')
先读取上次保存的数据:105264个细胞
import scanpy as scimport scviimport seaborn as snsimport numpy as npimport pandas as pdimport matplotlib.pyplot as plt# 注意dir改成自己的路径dir = '/path/data/GSE171524/'adata = sc.read_h5ad(dir+'combined.h5ad')adataAnnData object with n_obs × n_vars = 105264 × 29236 obs: 'Sample', 'doublet', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo' var: 'n_cells'
每个样本中的各种指标统计:
adata.obs.groupby('Sample').count()
共26个样本:
sc.pp.filter_genes(adata, min_cells = 100)adata.layers['counts'] = adata.X.copy()sc.pp.normalize_total(adata, target_sum = 1e4)sc.pp.log1p(adata)adata.raw = adata# 查看每个细胞的指标adata.obs.head()
结果图:
这个过程运行时间也会稍长一些
scvi.model.SCVI.setup_anndata(adata, layer = "counts", categorical_covariate_keys=["Sample"], continuous_covariate_keys=['pct_counts_mt', 'total_counts', 'pct_counts_ribo'])model = scvi.model.SCVI(adata)#may take a while without GPUmodel.train()adata.obsm['X_scVI'] = model.get_latent_representation()adata.layers['scvi_normalized'] = model.get_normalized_expression(library_size = 1e4)sc.pp.neighbors(adata, use_rep = 'X_scVI')sc.tl.umap(adata)sc.tl.leiden(adata, resolution = 0.5)sc.pl.umap(adata, color = ['leiden', 'Sample'], frameon = False)plt.savefig(dir+"01-intergration_umap.png")
结果图:
保存数据:
adata.write_h5ad(dir + 'integrated.h5ad')
使用没有矫正的数据做差异表达分析
# 更改聚类数sc.tl.leiden(adata, resolution = 1)sc.tl.rank_genes_groups(adata, 'leiden')# 可视化sc.pl.rank_genes_groups(adata, n_genes=20, sharey=False)plt.savefig(dir+"02-intergration_rank_genes_groups.png", dpi=300)
部分cluster的top基因:
markers = sc.get.rank_genes_groups_df(adata, None)markers = markers[(markers.pvals_adj < 0.05) & (markers.logfoldchanges > .5)]markers# 保存markers.to_csv(dir+'markers.csv')
差异结果:
使用模型标准化后的值做差异表达分析:
markers_scvi = model.differential_expression(groupby = 'leiden')# 设定阈值markers_scvi = markers_scvi[(markers_scvi['is_de_fdr_0.05']) & (markers_scvi.lfc_mean > .5)]markers_scvi# 保存markers_scvi.to_csv(dir+'scvi_markers.csv')
结果:
重新聚类后的结果可视化:总共得到34个cluster
sc.pl.umap(adata, color = ['leiden'], frameon = False, legend_loc = "on data")plt.savefig(dir+"02-intergration_umap_1.png")## 有多少clusteradata.obs['leiden']#Name: leiden, Length: 105264, dtype: category#Categories (34, object): ['0', '1', '2', '3', ..., '30', '31', '32', '33']
结果图:
文章中进行了三次注释,第一次注释大类,主要为9个类:
这几个类的基因文章中没有提供,就用我们自己收集的基因来进行注释好了。
在视频资源中,视频speaker老师给了一张图:是目前免疫细胞分类很详细的一张图了:
https://learn.cellsignal.com/hubfs/landing-pages/2019/18-IMM-18284/18-IMM_18284-Human%20Markers%20PWHO-digital.pdf
基因可视化:
markers = ['EPCAM','KRT8','KRT18','KRT19', # epithelial cells 'CD68', 'CTSS', 'FCN1','CD163', # myeloid cells 'ACTA2', 'DCN', 'ACTB' , # fibroblasts 'PECAM1','CD34','VWF', # endothelial cells 'PTPRC','CD3D','CD3E','CD3G','CD8A', 'CD4', # T and natural killer (NK) lymphocytes 'CD79A', 'MS4A1','CD19', # B lymphocytes and plasma cells 'SNAP25', 'SYT1', # neuronal cells 'CPA3', # mast cells 'CST3', 'LAMP3', 'HLA-DQB2', 'HLA-DPB1', 'BIRC3', # antigen-presenting cells (APCs; primarily dendritic cells) 'MKI67','TOP2A', # Cycling 'HBB','HBD' # Erythroid ]# 看某一个基因的表达情况markers[markers.names=="HBB"]markers[markers.group=="30"]#, layer = 'scvi_normalized'# , vmax = 5sc.pl.umap(adata, color = ["HBB"], frameon = False, layer = 'scvi_normalized', vmax =4)plt.savefig(dir+"03-intergration_umap_Erythroid_1.png")
以marker为基础,绘制如下类似图,vmax参数可以进行调整让结果看起来更明显,注视不出来的可以看看每个cluster高表达的基因,查一查功能就立马可以推断出来了:
结合marker表达以及cluster特异性高表达基因:
细胞注释如下:
for x in range(0,34): print(f'"{x}":"", ')cell_type = {"0":"T Cell","1":"Epithelial Cell","2":"Myeloid Cell","3":"Fibroblast","4":"Myeloid Cell","5":"T Cell","6":"Myeloid Cell","7":"Epithelial Cell","8":"Endothelial","9":"Fibroblast","10":"Myeloid Cell","11":"pDC","12":"Epithelial Cell","13":"Fibroblast","14":"Fibroblast","15":"Epithelial Cell","16":"Epithelial Cell","17":"Myeloid Cell","18":"Epithelial Cell","19":"Epithelial Cell","20":"Neuronal Cell","21":"Epithelial Cell","22":"B Cell","23":"Mast Cell","24":"T Cell","25":"pDC","26":"Myeloid Cell","27":"Endothelial","28":"Fibroblast","29":"Myeloid Cell","30":"Erythroid","31":"B Cell","32":"Neuronal Cell","33":"Myeloid Cell"}
adata.obs['cell type'] = adata.obs.leiden.map(cell_type)sc.pl.umap(adata, color = ['cell type'], frameon = False, legend_loc = "on data")plt.tight_layout()plt.savefig(dir+"04-intergration_umap_anno.png", dpi=300)adataadata.uns['scvi_markers'] = markers_scviadata.uns['markers'] = markers# 保存adata.write_h5ad(dir + 'integrated_anno.h5ad')model.save(dir + 'model.model')
细胞注释结果如下:
这里注释出来了一串文献中没有的红细胞。
文献中的Fig1如下:
如果要绘制Fig1,需要对上次笔记(Python图文复现2022|03-多样本整合分析)中的注释再进行进一步的详细注释,上次注释结果:
import scanpy as sc
import scvi
import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# 注意dir改成自己的路径
dir = '/dir/data/GSE171524/'
adata = sc.read_h5ad(dir+'integrated_anno.h5ad')
adata
marker采用来自数据库:PanglaoDB https://panglaodb.se/
首先是B,T/NK细胞,有:
上皮群:
髓系:
基质细胞:
markers = adata.uns['markers']
# 看某一个基因的表达情况
markers[markers.names=="RGS5"]
#, layer = 'scvi_normalized'
# , vmax = 5
sc.pl.umap(adata, color = ["RGS5"], frameon = False, layer = 'scvi_normalized', vmax =2)
plt.savefig(dir+"03-intergration_umap_gene_RGS5.png")
先给出每个cluster编号以及需要填充的空格,在后续的操作中这个地方会填上每个注释结果:
for x in range(0,34):
print(f'"{x}":"", ')
cell_type_sub = {"0":"CD4+ T cells",
"1":"AT1",
"2":"Macrophages",
"3":"Fibroblast",
"4":"Macrophages",
"5":"CD8+ T cells",
"6":"Macrophages",
"7":"AT2",
"8":"Endothelial",
"9":"Fibroblast",
"10":"Macrophages",
"11":"Plasma cells",
"12":"AT2",
"13":"Fibroblast",
"14":"Fibroblast",
"15":"AT2",
"16":"Airway epithelial",
"17":"Macrophages",
"18":"Airway epithelial",
"19":"AT2",
"20":"Neuronal Cell",
"21":"Airway epithelial",
"22":"B Cell",
"23":"Mast Cell",
"24":"Cycling T/NK",
"25":"Plasma cells",
"26":"DCs",
"27":"Endothelial",
"28":"smooth muscle cells",
"29":"Monocytes",
"30":"Erythroid",
"31":"B Cell",
"32":"Neuronal Cell",
"33":"Macrophages"
}
注释后:
adata.obs['cell type_sub'] = adata.obs.leiden.map(cell_type_sub)
sc.pl.umap(adata, color = ['cell type_sub'], frameon = False, legend_loc = "on data")
plt.tight_layout()
plt.savefig(dir+"04-intergration_umap_anno_sub.png", dpi=300)
# 保存
adata.write_h5ad(dir + 'integrated_anno.h5ad')
详细注释结果如下:
首先对细胞进行计数:
# adata = sc.read_h5ad('integrated.h5ad')
adata.obs.Sample.unique().tolist()
# 添加一列细胞的样本来源
def map_condition(x):
if 'cov' in x:
return 'COVID19'
else:
return 'Control'
adata.obs['condition'] = adata.obs.Sample.map(map_condition)
adata.obs
结果图:
# 每个样本中的细胞数统计
num_tot_cells = adata.obs.groupby(['Sample']).count()
num_tot_cells = dict(zip(num_tot_cells.index, num_tot_cells.doublet))
num_tot_cells
# {'C51ctr': 10934, 'C52ctr': 3967, 'C53ctr': 6076, 'C54ctr': 3860, 'C55ctr': 5110, 'C56ctr': 3609, 'C57ctr': 4307, 'L01cov': 2737,'L03cov': 3604, 'L04cov': 3056, 'L04covaddon': 4073, 'L05cov': 2435, 'L06cov': 5657, 'L07cov': 4217, 'L08cov': 3473, 'L09cov': 3075, 'L10cov': 2713, 'L11cov': 2531, 'L12cov': 3259, 'L13cov': 4244, 'L15cov': 3516, 'L16cov': 1600, 'L17cov': 3937, 'L18cov': 2392, 'L19cov': 2135, 'L21cov': 2961, 'L22cov': 5786}
cell_type_counts = adata.obs.groupby(['Sample', 'condition', 'cell type']).count()
cell_type_counts = cell_type_counts[cell_type_counts.sum(axis = 1) > 0].reset_index()
cell_type_counts = cell_type_counts[cell_type_counts.columns[0:4]]
cell_type_counts
# Sample condition cell type doublet
# 0 C51ctr Control B Cell 70
# 1 C51ctr Control Endothelial 1153
# 2 C51ctr Control Epithelial Cell 3711
# 3 C51ctr Control Fibroblast 1497
# 4 C51ctr Control Mast Cell 290
# .. ... ... ... ...
# 246 L22cov COVID19 Mast Cell 7
# 247 L22cov COVID19 Myeloid Cell 1663
# 248 L22cov COVID19 Neuronal Cell 83
# 249 L22cov COVID19 T Cell 1379
# 250 L22cov COVID19 pDC 510
# [251 rows x 4 columns]
# 计算频率
cell_type_counts['total_cells'] = cell_type_counts.Sample.map(num_tot_cells).astype(int)
cell_type_counts['frequency'] = cell_type_counts.doublet / cell_type_counts.total_cells
cell_type_counts
每种细胞类型在COVID19与control两种条件下的频率差异:
绘图:
plt.figure(figsize = (10,4))
ax = sns.boxplot(data = cell_type_counts, x = 'cell type', y = 'frequency', hue = 'condition')
plt.xticks(rotation = 35, rotation_mode = 'anchor', ha = 'right')
plt.tight_layout()
plt.savefig(dir+"05-intergration_Fig1D.png", dpi=300)
结果图如下:
C图如下:
sc.pl.umap(adata, color = ['cell type_sub','condition'], frameon = False, legend_loc = "on data")
plt.savefig(dir+"05-intergration_Fig1B_C.png", dpi=300)
结果图如下:
下期见~
本次绘制图C:
Differential gene expression (log-normalized, scaled; see Methods) of AT1 and AT2 cells from COVID-19 and control lungs. Columns, single cells; rows, expression of top-regulated genes. Left bar, lineage markers for AT1 (purple) and AT2 (pink) cells. Colour-coded top lanes indicate expression strength of signatures (log-normalized; see Methods) and group assignment as indicated on the right. exp., expression
import scanpy as sc
import scvi
import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# 注意dir改成自己的路径
dir = '/dir/data/GSE171524/'
adata = sc.read_h5ad(dir+'integrated_anno.h5ad')
adata
首先提取注释到的AT1与AT2
subset = adata[adata.obs['cell type_sub'].isin(['AT1', 'AT2'])].copy()
subset
总共20741个细胞
差异分析可使用SCVI or diffxpy来做
这里需要安装diffxpy这个包,还比较麻烦。需要安装在scanpy的conda环境中,安装如下:
# 先安装依赖 batchglm 需要安装以前的版本:https://pypi.org/project/batchglm/#history 找到0.7.4版本,下载到本地
tar -zxvf batchglm-0.7.4.tar.gz
cd batchglm-0.7.4
pip install -e .
# 再安装 diffxpy
# 下载包到本地:https://github.com/theislab/diffxpy
unzip diffxpy-master.zip
cd diffxpy-master
pip install -e .
这样就安装成功了:
# two options: SCVI or diffxpy
import diffxpy.api as de
subset.X = subset.X.toarray()
len(subset.var)
#20858
# 低表达过滤
sc.pp.filter_genes(subset, min_cells=100)
len(subset.var)
#13412
# 修改一下细胞注释的列名
subset.obs = subset.obs.rename(columns = {'cell type_sub':'cell_type_sub'})
subset.obs
差异分析:
#if want to test between covid/non covid
# res = de.test.wald(data=subset,
# formula_loc= '~ 1 + condition',
# factor_loc_totest='condition'
# )
res = de.test.wald(data=subset,
formula_loc= '~ 1 + cell_type_sub',
factor_loc_totest='cell_type_sub'
)
dedf = res.summary().sort_values('log2fc', ascending = False).reset_index(drop = True)
dedf
结果如下:
查看谁是实验组,谁是对照组,并调整为 AT1 vs AT2
subset.obs.cell_type_sub.unique()
# ['AT2', 'AT1']
# Categories (2, object): ['AT1', 'AT2']
most_up = dedf.iloc[0].gene
i = np.where(subset.var_names == most_up)[0][0]
a = subset[subset.obs.cell_type_sub == 'AT1'].X[:, i]
b = subset[subset.obs.cell_type_sub == 'AT2'].X[:, i]
print(f"{most_up} expression:")
print(f"AT1: {a.mean()}")
print(f"AT2: {b.mean()}")
# THBS2 expression:
# AT1: 0.0
# AT2: 0.04352555051445961
dedf['log2fc'] = dedf['log2fc']*-1
dedf = dedf.sort_values('log2fc', ascending = False).reset_index(drop = True)
dedf
结果图:
挑选显著差异结果:
dedf = dedf[(dedf.qval < 0.05) & (abs(dedf.log2fc) > .5)]
dedf
结果图如下,有4836个显著差异表达基因:
卡表达值:
dedf = dedf[dedf['mean'] > 0.15]
dedf
还有1095个基因:
选择top50个基因绘制热图:
#top 25 and bottom 25 from sorted df
genes_to_show = dedf[-25:].gene.tolist() + dedf[:25].gene.tolist()
sc.pl.heatmap(subset, genes_to_show, groupby='cell_type_sub', swap_axes=True)
plt.savefig(dir+"06-intergration_heatmap.png")
结果图:
# 加载上一次保存的模型
model = scvi.model.SCVI.load(dir+ 'model.model', adata)
model
# 差异分析
scvi_de = model.differential_expression(
idx1 = [adata.obs['cell type_sub'] == 'AT1'],
idx2 = [adata.obs['cell type_sub'] == 'AT2']
)
scvi_de
# 显著性结果筛选
scvi_de = scvi_de[(scvi_de['is_de_fdr_0.05']) & (abs(scvi_de.lfc_mean) > .5)]
scvi_de = scvi_de.sort_values('lfc_mean')
scvi_de
# 表达筛选
scvi_de = scvi_de[(scvi_de.raw_normalized_mean1 > .5) | (scvi_de.raw_normalized_mean2 > .5)]
scvi_de
最终筛选到952个基因:
#top 25 and bottom 25 from sorted df
genes_to_show = scvi_de[-25:].index.tolist() + scvi_de[:25].index.tolist()
sc.pl.heatmap(subset, genes_to_show, groupby='cell_type_sub', swap_axes=True, layer = 'scvi_normalized',log = True)
plt.savefig(dir+"06-intergration_heatmap_scvi.png")
结果图如下:
结果保存
dedf.to_csv(dir+'dedf.csv')
scvi_de.to_csv(dir+'scvi_de.csv')
下次见~
import scanpy as sc
import scvi
import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# 注意dir改成自己的路径
dir = '/dir/data/GSE171524/'
adata = sc.read_h5ad(dir+'integrated_anno.h5ad')
adata
# 添加一列细胞的样本来源,上次没保存这个
def map_condition(x):
if 'cov' in x:
return 'COVID19'
else:
return 'Control'
adata.obs['condition'] = adata.obs.Sample.map(map_condition)
adata.obs
# 提取注释到的AT1与AT2
subset = adata[adata.obs['cell type_sub'].isin(['AT1', 'AT2'])].copy()
subset
# 差异分析结果
dedf = pd.read_csv(dir + 'dedf.csv', index_col=0)
scvi_de = pd.read_csv(dir + 'scvi_de.csv', index_col=0)
接下来对AT1 vs AT2的差异表达基因进行功能富集分析。
python中使用的是gseapy包进行功能富集分析,需要安装在scanpy的conda环境中,并且这个包需要在联网状态下使用:
# this method requires internet connection
# 安装:conda install -c bioconda gseapy -y
import gseapy as gp
# 查看包中都有哪些功能数据库集合
gp.get_library_name()
enr = gp.enrichr(gene_list= dedf[dedf.log2fc > 0].gene.tolist(),
gene_sets=['KEGG_2021_Human','GO_Biological_Process_2021'],
organism='human', # don't forget to set organism to the one you desired!
outdir=None, # don't write to disk,
background = subset.var_names.tolist()
)
enr.results
结果如下:
subset.obs = subset.obs.rename(columns = {'cell type_sub':'cell_type_sub'})
sc.pl.violin(subset[subset.obs.cell_type_sub == 'AT2'], 'ETV5', groupby='condition')
plt.savefig(dir+"07-intergration_condition_ETV5.png")
结果图如下:
查看基因ETV5在两组中的表达显著性:使用mannwhitneyu进行检验
from scipy import stats
temp = subset[subset.obs.cell_type_sub == 'AT2']
i = np.where(temp.var_names == 'ETV5')[0][0]
a = temp[temp.obs.condition == 'COVID19'].X[:,i].todense()
b = temp[temp.obs.condition == 'Control'].X[:,i].todense()
stats.mannwhitneyu(a, b)
结果如下:MannwhitneyuResult(statistic=array([[17353149.]]), pvalue=array([[2.10908943e-171]]))
首先从文献的补充材料table S4的找到signature集合:our_DATP_sig,总共163个基因
另存为:datp_sig.txt,不要表头
这个基因集是什么来头呢,文中描述如下:
Recent studies have shown that inflammation can induce a cell state that is characterized by failure to fully transition to AT1 cells; this has been termed ‘damage-associated transient progenitors’ (DATPs), ‘alveolar differentiation intermediate’ (ADI), or ‘pre-AT1 transitional cell state’ (PATS)25–27 (hereafter referred to as DATPs)
现在想看DATPs这个基因集在AT1/AT2中不同分组COVID-19和Control的打分情况,是否支持:incomplete transition of AT2 to AT1 cells in COVID-19 lungs(在肺再生过程中,AT2细胞作为AT1细胞的祖细胞)与 除了病毒感染直接破坏肺泡上皮外,COVID-19患者的肺再生过程也受到损害 。
#gene signature, ie, input list of genes from user
with open(dir + 'datp_sig.txt') as f:
datp_sig = [x.strip() for x in list(f)]
sc.tl.score_genes(subset, datp_sig, score_name = 'datp')
subset.obs
现在可以看到数据中多了一列datp的打分:
打分可视化:
sc.pl.violin(subset, 'datp', groupby='condition')
plt.savefig(dir+"08-intergration_datp_sig.png")
结果如下:
显著性检验:
a = subset[subset.obs.condition == 'COVID19'].obs.datp.values
b = subset[subset.obs.condition == 'Control'].obs.datp.values
stats.mannwhitneyu(a, b)
# 结果
# MannwhitneyuResult(statistic=77312171.0, pvalue=0.0)
# 将打分UMAP可视化
sc.pl.umap(subset, color = 'datp', vmax = 0.5)
plt.savefig(dir+"09-intergration_datp_sig_umap.png")
结果图如下:
本次国庆特刊单细胞python图文复现到此更新完毕,关于结果解读,如为什么选择绘制基因ETV5、CAV1在AT1/AT2不同分组COVID19 vs Control中的表达等等,视频中对文献结果的解读更精彩,可前往观看~
https://mp.weixin.qq.com/s/PFG0JXHm8BTA1ZBf-r1Lxw