ArrayExpress数据集，又是安捷伦芯片？！

ArrayExpress数据集，又是安捷伦芯片？！ by 生信菜鸟团

写在前面：菜鸟入坑不久，除了GEO和TCGA之外，认得的数据库资源寥寥可数，恰巧接到了一个任务，是分析上传到ArrayExpress数据的一个数据集，那就……试试看？

ArrayExpress简介

ArrayExpress < BioStudies < EMBL-EBI

功能基因组学数据收集（ArrayExpress），储存了来自高通量功能基因组学实验的数据，并向研究界提供数据供再利用。根据社区指南，一项研究通常包含元数据，如详细的样本注释、协议、处理后的数据和原始数据。来自高通量测序研究的原始序列读数被中介到欧洲核苷酸档案馆（ENA），并提供链接以从ENA下载序列读数。数据可以通过其专用的提交工具Annotare提交给ArrayExpress集合。

在尝试分析数据集之前，有两篇文献建议大家先读一下，了解这个数据库的基本架构以及用R包下载和处理数据集的protocol.

ArrayExpress--a public database of microarray experiments and gene expression profiles - PubMed
Importing ArrayExpress datasets into R/Bioconductor - PMC

通常，一个数据集提供的数据会包含以下的内容⬇

MAGE-TAB is a tabular MIAME supportive file format and MAGE-TAB documents consist of five different types of files.

(i) A ‘raw’ zip archive contains the raw data files, i.e. the files produced by the microarray image analysis software, such as CEL files for Affymetrix GeneChips or GPR files from GenePix.
(ii) A ‘data matrix’ file contains processed values, as provided by the data submitter, converted into a common tab-delimited text format representing a matrix of numbers.
(iii) The Sample and Data Relationship Format (SDRF) tab-delimited file contains the relationships between samples and arrays, as well as sample properties and experimental factors, as provided by the data submitter.
(iv) The Array Design Format (ADF) tab-delimited file describes the design of an array, i. e. the sequence located at each feature on the array and annotation of the sequences.
(v) The Investigation Description Format (IDF) tab-delimited file contains top-level information about the experiment including title, description, submitter contact details and protocols.

其次，如何将以上的信息匹配到数据分析的过程中呢？

The ArrayExpress package uses the zip archive with either the raw or the processed data to build the assayData component.
The SDRF ﬁle is used to construct the phenoData table.
The ADF ﬁle is used to construct the featureData.
The IDF ﬁle to ﬁll in the experimentData components.

当然，ArrayExpress 官方教程还提供了如何检索以及下载数据的方式
（https://www.ebi.ac.uk/training/online/courses/array-express-discover-functional-genomics-data-quickly-and-easily/files-and-download/)

数据集

今天我们要分析的数据集是这样的：

A complete file that provides all of the mRNAs and miRNAs present in the arrays used in this study, as well as the experimental conditions, is available online at the ArrayExpress database (http://www.ebi.ac.uk/arrayexpress), Array Express accession EMTAB-3564 (for mRNA hybridizations) and E-MTAB—3563 (for miRNA hybridizations).

以mRNA为例，文章提到“mRNA Expression Profiling with Agilent 4 × 44 K Rat Oligoarrays ”

Agilent的芯片之前有分享过，可以借鉴

数据集下载地址在这里：

[E-MTAB-3564 https://www.ebi.ac.uk/biostudies/arrayexpress/studies/E-MTAB-3564]

代码实操

step01 数据下载

有两种方式，可以用函数，但因为我是第一次用这个数据库，所以选择了保守的方式直接下载

从ArrayExpress数据库下载一个芯片做注释_https://blog.csdn.net/u013429737/article/details/120196551)，这篇文章作者有详细的介绍。至于手动的方式就不用多说，进入数据集链接下载就行。

step02 原始数据处理

参考小洁老师的帖子：

# reference:https://www.jianshu.com/p/19ff109b6409

library(ArrayExpress)
library(limma)
library(utils)
library(stringr)

getwd()

读取数据：

dir <- "../E-MTAB-3564/" ##这个文件夹里只有样本的txt文件哦
rawdat <- list.files(dir,pattern = "_\\d.txt") ##仅读取样本信息
pd <-read.delim("../E-MTAB-3564.sdrf.txt") ##读取有临床信息的文件

获取分组信息：

#调整pd与rawdata的顺序一致
raw_order = gsub(".txt","",rawdat)
pd$Source.Name
x1 <- rep(c("medulactrl_","medulaovx_","calvariactrl_","calvariaovx_"),each=3)
x2 <- rep(1:3,each=1,times=4)
pd$group <- paste0(x1,x2)
pd$group
pd = pd[match(raw_order,pd$group),]

group_list <- ifelse(str_detect(pd$group,"calvariactrl"),"calvariactrl",
                    ifelse(str_detect(pd$group,"calvariaovx"),"calvariaovx",
                           ifelse(str_detect(pd$group,"medulactrl"),"medulactrl",
                     "medulaovx")))
table(group_list)

读取原始数据：

raw_dir = "../E-MTAB-3564/"
raw_datas = paste0(raw_dir,dir(raw_dir))
x <- read.maimages(raw_datas,
                   source="agilent", 
                   green.only=TRUE,
                   other.columns = "gIsWellAboveBG")
dim(x)

背景校正和标准化：

y <- backgroundCorrect(x, method="normexp")
y <- normalizeBetweenArrays(y, method="quantile")
class(y)

基因过滤：

Control <- y$genes$ControlType==1L;table(Control)##去除对照探针
NoSymbol <- is.na(y$genes$GeneName);table(NoSymbol)##去除匹配不到genesymbol的探针
IsExpr <- rowSums(y$other$gIsWellAboveBG > 0) >= 6;table(IsExpr)##去除不表达的探针，去除的标准是：至少在一半样本中高于背景，通过y(other)gIsWellAboveBG来判断
Isdup <- duplicated(y$genes$GeneName);table(Isdup)##测到多次的基因，只保留一个探针
yfilt <- y[!Control & !NoSymbol & IsExpr & !Isdup, ]
dim(yfilt)

现在应该拿到熟悉的矩阵了吧

exp = yfilt@.Data[[1]]
colnames(exp) <- str_split(colnames(exp),"/",simplify = T)[,3]
boxplot(exp,las=2)
exp[1:2,1:2]

基因注释：

anno = yfilt$genes
nrow(anno);nrow(exp)
rownames(exp)=rownames(anno)
ids = unique(anno$GeneName)
exp = exp[!duplicated(anno$GeneName),]
rownames(exp) = ids
exp[1:4,1:4]

range(exp)

##保存
getwd()
save(group_list,exp,file = "step1-output.Rdata")

step03 差异分析

多组间的数据该如何做差异分析呢？

rm(list = ls())

load("step1-output.Rdata")

# 差异分析 --------------------------------------------------------------------
###构建表达矩阵，非常重要！！！
library(limma)
design <- model.matrix(~0+factor(group_list))
colnames(design)=levels(factor(group_list))
rownames(design)=colnames(exp)
px <- colnames(design);px
contrast.matrix <- makeContrasts(paste0(px[2],"-",px[1]), 
                                 paste0(px[4],"-",px[3]), 
                                 paste0(px[3],"-",px[1]), 
                                 paste0(px[4],"-",px[2]),
                                 levels=design) ##实验组在前，对照组在后
fit <- lmFit(exp, design)
fit2 <- contrasts.fit(fit, contrast.matrix)
fit2 <- eBayes(fit2)

deg = lapply(1:ncol(design), function(i){
  topTable(fit2,coef = i,number = Inf)
})
names(deg) = colnames(contrast.matrix)

差异分析阈值[Unpaired T test statistical analysis (P≤0.005) with a fold change ≥2.0 was used for mRNA]

logFC_t=2.0
P.Value_t=0.005
library(dplyr)
library(clusterProfiler)
library(org.Rn.eg.db) ##注意物种
for(i in 1:length(deg)){
  change=ifelse(deg[[i]] $P.Value>P.Value_t,'stable', 
                ifelse( deg[[i]]$logFC >logFC_t,'up', 
                        ifelse( deg[[i]]$logFC < -logFC_t,'down','stable') )
  )
  deg[[i]] <- mutate(deg[[i]],change)
  print(table(deg[[i]]$change))
  deg[[i]] <- mutate(deg[[i]],v = -log10(P.Value))

#加ENTREZID列，后面富集分析要用
deg[[i]]$symbol <- rownames(deg[[i]])
s2e <- bitr(unique(deg[[i]]$symbol), fromType = "SYMBOL",
            toType = c( "ENTREZID"),
            OrgDb = org.Rn.eg.db)
deg[[i]] <- inner_join(deg[[i]],s2e,by=c("symbol"="SYMBOL"))
}
save(deg,group_list,file  ="DEG.Rdata")

step04 可视化

PCA、热图和火山图是不可或缺的，想想怎么能把它们放到一起呢？

###主成分分析
dat=as.data.frame(t(exp))
library(FactoMineR)#画主成分分析图需要加载这两个包
library(factoextra) 
# pca的统一操作走起
dat.pca <- PCA(dat, graph = FALSE)
p1 <- fviz_pca_ind(dat.pca,
                   geom.ind = "point", # show points only (nbut not "text")
                   col.ind = group_list, # color by groups
                   #palette = c("#00AFBB", "#E7B800"),
                   addEllipses = TRUE, # Concentration ellipses
                   legend.title = "Groups"
)

#热图 
cg=names(tail(sort(apply(exp,1,sd)),500))
if(!require(pheatmap))install.packages("pheatmap")
library(pheatmap)
n=exp[cg,]
annotation_col=data.frame(group=group_list)
rownames(annotation_col)=colnames(n) 
p2 <- pheatmap(n,
               show_colnames =F,
               show_rownames = F,
               cluster_cols = F,
               annotation_col=annotation_col,
               scale = "row")
require(ggplotify)
p2= as.ggplot(p2)

##火山图
library(tinyarray)
vo = lapply(1:length(deg),function(k){
  draw_volcano(deg[[k]],
               lab = colnames(contrast.matrix)[k],
               pkg = 4,
               logFC_cutoff = logFC_t,
               pvalue_cutoff = P.Value_t,
               symmetry = T)
})

拼图

library(patchwork)
plotlist = c(list(p2,p1),vo)
layout(matrix(c(1,1,2,2,3,3,3,3), 2, 4,byrow = TRUE))
patchwork <- wrap_plots(plotlist) 
ggsave(patchwork,filename = "../output/patchwork.pdf",width = 15,height = 15)

那么，你感兴趣的到底是哪两个组的差异分析结果呢？anyway，我们都搞定了，任君挑选~~~

ixxmu / mp_duty