明明PCA区分非常好，但是差异基因数量很少？ by 生信菜鸟团

本周我们将给大家带来一篇文章，在这篇文章中PCA的效果明明很好，但差异表达基因数量却很少，富有经验的曾老师看出了问题并认为值得深入分析，所以我们这周就来探讨这个问题。

文章背景内容介绍

Multi-level transcriptome sequencingidentifies COL1A1 as a candidate marker in human heart failure progression

https://link.springer.com/article/10.1186/s12916-019-1469-4

GSE135055

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE135055

背景：
心力衰竭（HF）已被公认为全球流行病，其相关住院率、发病率和死亡率很高。尽管已经取得了许多进展，但其代表性的分子特征在很大程度上仍然未知，尤其是基因在HF进展中的作用。本前瞻性随访研究的目的是揭示与心力衰竭进展相关的潜在生物标志物。
方法：
我们收集了21名HF患者和9名健康供体的左心室心脏组织作为研究队列，并生成了多层次转录组数据。通过使用Masson染色来计算每个样本的纤维化百分比，我们应用lasso回归模型来识别与纤维化和进展相关的基因。
在同一队列中通过免疫组织化学（IHC）染色和使用另一个独立队列（20名HF和9名健康供体）的qRT-PCR进一步验证了这些基因。
酶联免疫吸附试验（ELISA）用于测量验证队列（87名HF患者）中的血浆水平，以预测HF进展。
结果：
基于多级转录组数据，我们检测了研究队列中差异表达的基因[mRNAs、microRNAs和长非编码RNA（lncRNAs）]。后续的功能注释和调节网络揭示了它们在调节细胞外基质中的潜在作用。我们进一步鉴定了几个与纤维化相关的基因。
通过使用移植前的存活时间，COL1A1被确定为HF进展的潜在生物标志物，其上调通过IHC和qRT-PCR得到证实。此外，发现血浆中COL1A1含量>281.7ng/ml与心力衰竭心脏移植一年内的不良生存率有关[风险比（HR）：27.52，95%置信区间（CI）：7.474至101.3，对数秩p值<1.0×10-4]。
结论：
我们的研究结果表明，COL1A1是HF的血浆生物标志物，与HF的进展有关，尤其是用于预测从HF发作到移植的1年生存期。

As shown in Fig. 3a–c, principal component analysis (PCA) was first performed for the expression profiles of mRNAs, lncRNAs, and miRNAs. All three RNA types could largely distinguish HF patients from healthy controls.

Interestingly, lncRNA expression profiles had the best performance in distinguishing HF from healthy controls, reinforcing the importance to further investigate their underlying mechanism in HF.

We identified 126 mRNAs, 16 lncRNAs, and 42 miRNAs that were differentially expressed in HF versus control samples by requiring the absolute log2-transformed fold change (FC) > 1 and adjusted p value < 0.05 by the BH method.

我的思路：

上游分析的影响
差异表达分析的影响
如何评估PCA的效果？PCA效果好，差异表达基因一定多吗？

1.查看该实验的数据上游分析，看看注释到的转录本数量

The fragments per kilobase of transcript per million mapped reads (FPKM) value of 186,363 transcripts was calculated based on StringTie (version 1.3.4) with default parameters [16]. mRNA with low expression were excluded, defined as those with FPKM less than 1 in more than 80% samples. For lncRNAs, we did not remove the low expressed transcripts, while we identified the potential long non-coding transcripts by filtering out short transcripts (default 200 nt) and removing single exon transcripts based on the FEELnc tool (version 0.11-2) [17]. The remaining transcripts were then used to perform the following analysis.

For the pre-processing of miRNA data, we removed miRNAs with a missing value in > 10% of the samples. According to the miRNA sequence database miRBase [20] (Release 22), there were 2620 mature human miRNAs. Our analysis resulted in 604 expressed miRNAs in our samples.

理解这些信息以便我们下面查看GEO提供的表达量矩阵文件

2.看看作者怎么做的PCA和差异表达分析

对于mRNA，作者走的hisat2+stringtie+ballgown流程

参考资料：
使用TPM/FPKM/RPKM进行差异分析真的可以消除系统误差吗？
为什么很少有人用hisat2+stringtie+ballgown发转录组文章？https://www.zhihu.com/question/507628729/answer/2306597069
学徒作业-hisat2+stringtie+ballgown流程http://www.bio-info-trainee.com/7443.html
lncRNA入门到精通

We identified 126 mRNAs, 16 lncRNAs, and 42 miRNAs that were differentially expressed in HF versus control samples by requiring the absolute log2-transformed fold change (FC) > 1 and adjusted p value < 0.05 by the BH method.

需要注意的是作者这里是对RNA而非基因进行的差异表达分析

下载GEO提供的表达矩阵看看

根据第一部分我们得到的上游分析信息，可以发现，两个矩阵都是原始矩阵，没有进行过滤和筛选

The fragments per kilobase of transcript per million mapped reads (FPKM) value of 186,363 transcripts was calculated based on StringTie (version 1.3.4) with default parameters [16]. mRNA with low expression were excluded, defined as those with FPKM less than 1 in more than 80% samples. For lncRNAs, we did not remove the low expressed transcripts, while we identified the potential long non-coding transcripts by filtering out short transcripts (default 200 nt) and removing single exon transcripts based on the FEELnc tool (version 0.11-2) [17]. The remaining transcripts were then used to perform the following analysis.

For the pre-processing of miRNA data, we removed miRNAs with a missing value in > 10% of the samples. According to the miRNA sequence database miRBase [20] (Release 22), there were 2620 mature human miRNAs. Our analysis resulted in 604 expressed miRNAs in our samples.

关于如何解读stringtie定量得到的结果文件，参考资料：

关于stringtie定量基因的时候，最后输出很多MSTRG样式的geneidhttps://www.jianshu.com/p/1697f2d91eb5
MSTRG Tag, what does it refer to?https://github.com/gpertea/stringtie/issues/95
How to deal with MSTRG tag without relevant gene name?https://www.biostars.org/p/282817/

其实如果不找新基因/新转录本没必要用这套流程，可以直接使用我们之前介绍的hisat2+featureCount+DESeq2流程

我们这里使用AnnoProbe包根据gene_name得到RNA的分类信息，如果不是找新基因或新转录本就不需要再看没有gene_name的RNA

①对获取的fpkm矩阵使用limma进行差异表达分析：

rm(list=ls())
data <- read.table("GSE135055_FPKM_Expression_Matrix.txt",sep = '\t',header = TRUE)
library(tidyverse)
data <- tibble(data)

library(AnnoProbe)
ids=annoGene(data$gene_name,'SYMBOL','human')
head(ids)
as.data.frame(tail(sort(table(ids$biotypes))))
# Var1  Freq
# 1             unprocessed_pseudogene   408
# 2 transcribed_unprocessed_pseudogene   432
# 3                              miRNA  1855
# 4                             lncRNA  2239
# 5               processed_pseudogene  2687
# 6                     protein_coding 18513
mRNA_gene_name <- ids$SYMBOL[ids$biotypes == "protein_coding"]

fpkm <- data[,c(11:40)]

# 将FPKM值转换为TPM值
expMatrix <- fpkm
fpkmToTpm <- function(fpkm)
{
  exp(log(fpkm) - log(sum(fpkm)) + log(1e6))
}
tpms <- apply(expMatrix,2,fpkmToTpm) #每列的基因
tpms[1:3,]
colSums(tpms) #都是每一百万（深度一样）

# 筛选出非低表达mRNA
# mRNA
index_mRNA <- data$gene_name %in% mRNA_gene_name
table(index_mRNA)
# FALSE   TRUE 
# 64855 121508
# 非低表达
# mRNA with low expression were excluded, defined as those with FPKM less than 1 in more than 80% samples.
keep <- rowSums(fpkm>1) >= 24  # 30 * 0.8
table(keep)
# FALSE   TRUE 
# 175091  11272 
# 可以发现这一步过滤掉了很多低表达mRNA
keep_final <- keep & index_mRNA
table(keep_final)
# FALSE   TRUE 
# 177092   9271 
# 只剩下了近1w转录本

t_name_mRNA <- data$t_name[keep_final]
tpms_mRNA <- tpms[keep_final,]
rownames(tpms_mRNA) <- t_name_mRNA
head(tpms_mRNA)

# 差异分析
group_list=c(rep('HF',21),rep('Healthy',9))
## 强制限定顺序
group_list <- factor(group_list,levels = c("HF","Healthy"),ordered = F)
#表达矩阵数据校正
table(group_list)
exprSet <- tpms_mRNA
boxplot(exprSet,outline=FALSE, notch=T,col=group_list, las=2)

library(limma) 
exprSet=normalizeBetweenArrays(exprSet)
boxplot(exprSet,outline=FALSE, notch=T,col=group_list, las=2)

#判断数据是否需要转换
range(exprSet)
exprSet <- log2(exprSet+1)
range(exprSet)
#差异分析：
dat <- exprSet
design=model.matrix(~factor( group_list ))
fit=lmFit(dat,design)
fit=eBayes(fit)
options(digits = 4)
# 获取差异表达基因
deg=topTable(fit,coef=2,adjust='BH',number = Inf)
head(deg) 

# 筛选
# We identified
# 126 mRNAs, 16 lncRNAs, and 42 miRNAs that were differentially expressed in HF versus control samples 
# by requiring the absolute log2-transformed fold change (FC)> 1 and adjusted p value < 0.05 by the BH method.

if(T){
  logFC_t=1
  deg$g=ifelse(deg$P.Value>0.05,'stable',
               ifelse( deg$logFC > logFC_t,'UP',
                       ifelse( deg$logFC < -logFC_t,'DOWN','stable') )
  )
  table(deg$g)
  head(deg)
}

table(deg$g)
# DOWN stable     UP 
# 116   9108     47
# 116+47=163
# 作者鉴定出126个差异表达mRNA

我们通过limma对fpkm矩阵按照作者的筛选选标准进行差异表达分析得到的差异基因数量和作者大致一致，所以认为差异基因数量很少主要是因为作者使用转录水平mRNA表达量fpkm矩阵，并进行了严格的低表达过滤,而非进行差异表达分析使用的方法

可以发现在过滤低表达mRNA中大部分转录本都被过滤掉了

# mRNA with low expression were excluded, defined as those with FPKM less than 1 in more than 80% samples.
keep <- rowSums(fpkm>1) >= 24  # 30 * 0.8

在转录组专辑中我们之前使用raw counts进行差异分析，过滤低表达基因一般标准是这样的：

keep <- rowSums(rawcount>0) >= floor(0.5*ncol(rawcount))

②同时，stringtie转录组定量是如何将转录本表达水平映射到基因上的也值得深入探究，因为作者是对转录本进行的差异表达分析而非基因

参考资料：
使用TPM/FPKM/RPKM进行差异分析真的可以消除系统误差吗？
stringTie:转录本组装和定量工具
https://blog.csdn.net/weixin_43569478/article/details/108079257
本质上，stringTie只提供了转录本水平的表达量，定量方式包括TPM和FPKM值两种。为了进行raw count的定量方式，官方提供了 prepED.py脚本，可以计算出raw count的表达量
https://github.com/gpertea/stringtie/blob/master/prepDE.py3

下面是我使用谷歌检索“How does stringtie transcriptome quantification map transcript expression levels to genes”chatGPT插件返回的答案：

StringTie is a software tool used to assemble and quantify transcriptomes from RNA-seq data. It uses a reference genome to map RNA-seq reads and identify transcripts. The transcript assembly process involves grouping and constructing transcripts using a graph-based approach. StringTie also quantifies the abundance of each transcript using read coverage information.To map transcript expression levels to genes, StringTie first assigns each transcript to a gene based on genomic location and strand orientation. It then calculates the abundance of transcripts for each gene by summing up the abundance of all transcripts assigned to that gene. The gene expression level is then obtained by summing up the transcript expression levels across all genes.StringTie also provides the option to merge transcriptomes from multiple samples, which can improve gene annotation and provide more accurate quantification of gene expression levels. Overall, StringTie provides a reliable and efficient method for transcriptome assembly and quantification.

根据使用TPM/FPKM/RPKM进行差异分析真的可以消除系统误差吗？得到的基因表达矩阵和转录本表达矩阵，我们对上面的对基因对应转录本表达量直接加和的说法进行验证：

可以发现确实是直接进行加和

那么作者是对转录本进行差异表达分析的而非我们推文中常见的基因表达量，这肯定也会给差异表达分析结果数量特征带来影响

tips:

感兴趣的同学可以走一遍作者的hisat2+stringtie+ballgown流程，获取基因水平的表达矩阵进行差异分析，看看和转录本差异表达分析结果有什么区别
不认为不同流程进行差异分析对差异表达基因数量有特别大的影响（我们这里limma差异表达基因数量也和作者用ballgown差不多），感兴趣的同学可以获取sra/fastq自己走一遍hisat2+featureCount+DESeq2的流程

那么转录本表达量差异分析和我们常见的基因有什么区别呢？

Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3334321/

这篇文章虽然作者用的不是stringtie但其从同一数据集获得的差异表达结果中转录本确实更少

Gene-level differential analysis at transcript-level resolution

https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1419-z

Abstract:
与RNA测序转录本差异分析相比，基因水平的差异表达分析更稳健，并且在实验上更可行。然而，使用基因计数进行统计分析可以掩盖转录物水平的动态。我们证明了“分析第一，聚合第二”，即将转录物分析得出的p值聚合以获得基因水平的结果，提高了灵敏度和准确性。我们提出的方法也可以应用于从读数的伪比对中获得的转录物兼容性计数，这避免了量化的需要，并且快速、准确且无模型。该方法推广到生物学的各个层面，我们展示了它在基因本体论中的应用。

3.如何评估PCA的效果，PCA效果好差异表达基因一定多吗

As shown in Fig. 3a–c, principal component analysis (PCA) was first performed for the expression profiles of mRNAs, lncRNAs, and miRNAs. All three RNA types could largely distinguish HF patients from healthy controls.

Interestingly, lncRNA expression profiles had the best performance in distinguishing HF from healthy controls, reinforcing the importance to further investigate their underlying mechanism in HF.

在这里小编提出疑问：①作者为什么认为三种RNA类型都可以很好区分实验分组，有没有统计量可以评估PCA的效果？，②为什么认为“lncRNA表达谱的性能最佳”，mRNA、lncRNA、miRNA进行PCA时，PC1和PC2可解释的方差并不相同，也可以进行比较吗？

就这个问题，我咨询了曾老师，这是曾老师的回答：

Q：还有一个问题曾老师，我以前看PCA效果都是直接大致看个图比如有没有重叠，有什么统计量可以评估PCA效果呢？作者为什么认为“lncRNA表达谱的性能最佳”，mRNA、lncRNA、miRNA进行PCA时，PC1和PC2可解释的方差并不相同，也可以进行比较吗？
A：从这里看，好像有两个维度可以比较，第一个维度PC1是最重要的，根据标注的解释度来看miRNA分组是最大的，第二个维度就是两个分组的距离，作者在这里可能只说明第二个维度，也就是两个分组的距离，可能是想说分组的PCA图没有交集，两个圈隔得很远。
不过你说的很对，好像没有什么统计量可以评估PCA，完全就是肉眼在看，（判断分组意义）HC Horizontal clustering 层次聚类可能好一点，或者其他度量方法。左边两个PCA可能确实解释度一般，mRNA两个分组差异体现在斜对角也比较奇怪。
我们一般来说就是看Within-group variance of the samples is smaller than the between-group variance in study.

根据曾老师的提示和交流群小伙伴的帮助下，我得知了一种新的判断分组是否有意义的方法：ANOSIM，放在这里抛砖引玉，希望有想法的小伙伴可以在评论区积极讨论

考虑到这里作者对stringtie定量获得的lncRNA还使用了其他软件进行鉴定、分析，我们在这里仅使用mRNA表达矩阵进行演示

# PCA
rm(list = ls())
library(FactoMineR)
library(factoextra)
load("mRNA.Rdata")
dat <- tpms_mRNA
dat <- as.data.frame(t(dat))
head(dat)
dat$group_list <- group_list
dat_pca <- PCA(dat[,-ncol(dat)], graph = FALSE)#画图仅需数值型数据，去掉最后一列的分组信息
p <- fviz_pca_ind(dat_pca,
                     geom.ind = "point", # 只显示点，不显示文字
                     col.ind = dat$group_list, # 用不同颜色表示分组
                     palette = c("#00AFBB", "#E7B800"),
                     addEllipses = T, # 是否圈起来，少于4个样圈不起来
                     legend.title = "Groups") + theme_bw()
p

差异分析——ANOSIM相似性分析

https://mp.weixin.qq.com/s/KMtmN5Uq-8G0ym_ZzeA6bA

ANOSIM分析（Analysis of Similarities）即相似性分析，主要用于分析高维数据组间相似性，为数据间差异显著性评价提供依据。在一些高维数据分析中，需要使用PCA、PCoA、NMDS等方法进行降维，但这些方法并不显示组间差异的显著性指标，因此可以使用ANOSIM分析解决此问题。
ANOSIM相似性分析是一种非参数检验，用来检验组间（两组或多组）差异是否显著大于组内差异，从而判断分组是否有意义。首先利用Bray-Curtis算法计算两两样品间的距离，然后将所有距离从小到大进行排序，并计算R和P值。R值用于不同组间属否存在差异，P值用于说明是否存在显著差异
ANOSIM was used to compare within- and between-group similarity through a distance measure, to test the null hypothesis that the average rank similarity between samples within a group is the same as the average rank similarity between samples belonging to different groups.

组间差异分析：Anosim

https://zhuanlan.zhihu.com/p/111985434

Anosim分析（Analysis of similarities）是一种基于置换检验和秩和检验的非参数检验方法，用来检验组间的差异是否显著大于组内差异，从而判断分组是否有意义

假如R>0，说明组内距离小于组间距离，也即分组是有效的，这与方差分析中比较组内方差与组间方差来判断的原理是类似的。由上面分析结果可以看到R=0.4613，大于零模型99%分位数0.290，因此p值为0.001，结果是显著的。

# ANOSIM
library(vegan)
anosim=anosim(dat[,-ncol(dat)], group_list, permutations=999)
summary(anosim)
# Call:
#   anosim(x = dat[, -ncol(dat)], grouping = group_list, permutations = 999) 
# Dissimilarity: bray 
# 
# ANOSIM statistic R: 0.4647 
# Significance: 0.001 
# 
# Permutation: free
# Number of permutations: 999
# 
# Upper quantiles of permutations (null model):
#   90%   95% 97.5%   99% 
#   0.113 0.152 0.183 0.212 
# 
# Dissimilarity ranks between and within classes:
#   0%    25%   50%    75% 100%   N
# Between  9 200.00 291.0 369.00  435 189
# HF       3  85.25 170.5 285.75  425 210
# Healthy  1  32.75  73.5 160.00  370  36

这里的permutations=999指的是进行随机置换permutations次，用于计算统计显著性。在Anosim分析中，会将各样品间的距离进行排列组合得到许多可能性，然后比较实际组间差异和随机排列组合的差异，通过计算R值来判断组间差异是否显著大于随机差异，进而确定该分组是否有意义。permutations的值越大，计算精度越高，但计算时间也会相应增加。通常permutations的值设置为999或者9999可以满足分析需要。

总结

造成该研究出现PCA效果好但差异表达基因数量较少现象的原因：

作者严格的低表达量转录本的过滤
作者对转录本进行差异分析而非我们常见的基因（stringtie将转录本表达量映射到基因采取的是直接加和）

同时，我们在这里提出疑问：①作者为什么认为三种RNA类型都可以很好区分实验分组，有没有统计量可以评估PCA的效果？，②为什么认为“lncRNA表达谱的性能最佳”，mRNA、lncRNA、miRNA进行PCA时，PC1和PC2可解释的方差并不相同，也可以进行比较吗？

附上了曾老师的见解并提供了一种新的判断分组意义的方法ANOSIM以供参考，希望能起到抛砖引玉的效果，有想法的小伙伴可以在评论区进行探讨

写在文末

强烈建议你推荐给身边的博士后以及年轻生物学PI，多一点数据认知，让他们的科研上一个台阶：

生物信息学马拉松授课（买一得五），你的生物信息学入门课（5月8号已开课，可以期待一下下个月的~）
144线程640Gb内存服务器共享一年仍然是仅需800
千呼万唤始出来的独享生物信息学云服务器

再怎么强调生物信息学数据处理的计算机基础都不为过，我把它粗略的分成基于R语言的统计可视化，以及基于Linux的NGS数据处理：

ixxmu / mp_duty

明明PCA区分非常好，但是差异基因数量很少？ #3437