芯片数据处理提示：package rta10cdf not installed？

芯片数据处理提示：package rta10cdf not installed？ by 生信菜鸟团

今天主要处理一下下面这篇文献中提到的多个火山图：

Nat Commun 2023 Jul 24;14(1):4455. PMID: 37488113，《An osteoinductive and biodegradable intramedullary implant accelerates bone healing and mitigates complications of bone transport in male rats》。

火山图如下：

文章主要大概

骨运输是一种由手术驱动的治疗大骨缺损的程序。然而，具有挑战性的并发症包括长时间实结、对接部位骨不连和针束感染。在这里，作者通过混合组织工程构建技术开发了一种骨诱导和可生物降解的髓内植入物，使骨形态发生蛋白-2能够持续传递作为辅助治疗。
在雄性大鼠骨运输模型中，从植入物中洗脱的骨形态发生蛋白-2加速骨形成和重塑，导致早期骨融合，如影像学、力学测试、组织学分析和 microarray 分析所示。

这里就提到了使用 microarray 转录组技术分析揭示早期骨融合过程，来看一下具体的实验设计：

取无骨移植体(BLK)或加入2 μg BMP-2 (IMI + B2)的骨移植体进行骨移植4 d，取无骨移植体或骨移植体的骨缺损对照组(CTRL)在截骨当日进行急性骨移植。

差异分析分组有：

BLK Distal vs CTRL Distal
IMI+B2 Proximal vs BLK Proximal
BLK Distal vs lMl+B2 Proxima
IMI+B2 Distal vs BLK Distal
IMI+B2 Distal vs lMI+B2 Proximal

文章中对应的数据集为：The microarray data are available in the GEO database under the accession numbers GEO: GSE200518

这个数据没有提供处理好的表达矩阵，使用 affy 包处理原始芯片CELL文件提示缺少 rta10cdf 包但是网上是搜不到这个包的，提到一个帖子：https://support.bioconductor.org/p/97990/，可以使用 oligo 包进行数据预处理。

eset.rma <- rma(AffyBatch_data)
'getOption("repos")' replaces Bioconductor standard repositories, see 'help("repositories", package = "BiocManager")' for
details.
Replacement repositories:
    CRAN: http://mirror.lzu.edu.cn/CRAN/
Error in getCdfInfo(object) : 
  Could not obtain CDF environment, problems encountered:
Specified environment does not contain RTA-1_0
Library - package rta10cdf not installed
Bioconductor - rta10cdf not available

使用 oligo 包进行数据预处理

## load packages
rm(list=ls())
options(timeout = 100000)  # 修改下载时间为10万秒
options(scipen = 20)       #不以科学计数法表示
library(GEOquery)
library(stringr)


## download
gse <- getGEO("GSE200518", destdir = "data/", getGPL = F)
gse

class(gse)
str(gse)

# 提取临床信息
pheno <- pData(phenoData(gse[[1]]))
colnames(pheno)
pheno_select <- pheno[, c("geo_accession","title")]
temp <- str_split(pheno_select$title, pattern = " ", n = 3, simplify = T)
pheno_select$group <- paste0(temp[,1], "_", temp[,2])
write.table(pheno_select, file = "data/GSE200518.group.xls", quote = F, sep = "\t", row.names = F)


# 查看注释平台gpl
gpl <- getGEO(filename="data/GPL22388_family.soft.gz")
class(gpl)
gpl_anno <- gpl@dataTable@table
colnames(gpl_anno)
head(gpl_anno)

id2name <- gpl_anno[,c("probeset_id" ,"gene_assignment","swissprot","SPOT_ID")]
id2name$gene <- str_split(id2name$gene_assignment,'//',simplify = T)[,2]
table(id2name$SPOT_ID)
# 去除基因前后的空格
id2name$gene <- str_trim(id2name$gene, side = "both")
# 过滤掉空的探针
id2name <- filter(id2name, SPOT_ID!="---")
id2name <- filter(id2name, SPOT_ID!="Unassigned")
id2name <- filter(id2name, gene!="")

# 选择
id2name_1 <- id2name[, c("probeset_id","gene")]
colnames(id2name_1) <- c("ID","GENE_SYMBOL")
head(id2name_1)


###################################################
# 提取表达矩阵
# http://math.usu.edu/jrstevens/stat5570/1.4.Preprocess_4up.pdf
## 使用oligo包进行数据预处理
BiocManager::install("oligo")
library(oligo)

setwd("data/GSE200518_RAW/")
celFiles <- list.celfiles(listGzipped=T)
celFiles

affy <- read.celfiles(celFiles)
eset <- rma(affy)
eset

# 获取表达数据并输出到表格
exprs <- exprs(eset)
exprs[1:5,1:5]

# 处理一下样本名
colnames(exprs) <- str_split(colnames(exprs), pattern = "_", n = 2, simplify = T)[,1]
exprs <- exprs %>% 
  as.data.frame() %>% 
  rownames_to_column(var = "ID")

# 合并
exp_symbol <- merge(id2name_1, exprs, by.x="ID", by.y="ID")

# 多对一取均值
library(limma)
exp_symbol2 <- avereps(exp_symbol[,-c(1,2)],ID = exp_symbol$GENE_SYMBOL) %>% 
  as.data.frame()

saveRDS(exp_symbol2, file = "../GSE200518.exp_symbol.rds")

exp_symbol2 <- exp_symbol2 %>% 
  rownames_to_column(var="Gene")
write.table(exp_symbol2, file = "../GSE200518.exp_symbol.xls", row.names = F, sep = "\t", quote = F )

接下来使用limma进行差异表达分析

文中筛选阈值为：All genes with adjusted p-value < 0.05 and fold change > 1.2 were used for the analysis

差异分组：

BLK Distal vs CTRL Distal
IMI+B2 Proximal vs BLK Proximal
BLK Distal vs lMl+B2 Proxima
IMI+B2 Distal vs BLK Distal
IMI+B2 Distal vs lMI+B2 Proximal

使用 BLK Distal vs CTRL Distal 作为示例，总共得到 324个上调，321个下调基因。

# load package
library(dplyr)
library(limma)
packageVersion("limma")


## 读取表达矩阵
df <- read.delim(file = "data/GSE200518.exp_symbol.xls", row.names = 1, header=TRUE,check.names =F)
head(df)


## 读取差异分组信息
gp <- read.delim(file = "data/GSE200518.group.xls", header=T,check.names =F)
gp

## BLK Distal vs CTRL Distal
gp <- filter(gp, group=="Control_Distal" | group=="Blank_Distal" )

## 提取分组信息对应的样本
df <- df[, gp$geo_accession]

# 去掉在所有样本中表达全为0的基因行
df <- df[rowSums(df)!=0, ]

## make the design matrix
## BLK Distal vs CTRL Distal
group_list <- factor(gp[,3], levels = c("Control_Distal","Blank_Distal")) # 对照组在前，处理组在后
design <- model.matrix(~group_list)
fit <- lmFit(df, design)
fit <- eBayes(fit, trend=TRUE)
result <- topTable(fit, coef=2, n=Inf, sort.by="P", adjust.method = "BH")


## 筛选显著差异表达基因
result$regulated <- "None"
result$regulated[ result$logFC>log2(1.2) & result$P.Value<0.05 ] <- "Up"
result$regulated[ result$logFC < -log2(1.2) & result$P.Value<0.05 ] <- "Down"

table(result$regulated)

out_dt <- result %>% rownames_to_column(var = "GeneID")
write.table(out_dt, file = "data/GSE200518.Limma.Diff.xls", quote = F, sep = "\t", row.names = F)

热图绘制

作者在热图上展示了一些他关注的marker基因，比如：

The regenerate site in BLK showed elevated expression of：osteoblast/osteocyte markers (Alpl, Ibsp, Dmp1) and chondrocyte markers (Acan, Col2a1, Col10a1), suggesting that DO promotes active endochondral ossification during the early phase of DO (Fig. 8c–h).
Eluting BMP-2 upregulated the expression of osteogenic markers (Alpl, Dmp1, Ibsp) and downregulated skeletal muscle markers (Mb, Myog, Myoz1) (Fig. 8c–h), suggesting that BMP-2 promoted bone regeneration but prevented skeletal muscle differentiation at the docking site. T

rm(list = ls())
library(ggplot2)
library(tidyverse)
 
# 读取差异分析结果
data <- read.table(file = "data/GSE200518.Limma.Diff.xls", header = T, sep = "\t")
head(data)


# 绘制火山图
gene_select <- c("Fzd2", "Bmp8a", "Alpl", "Acan", "Mmp13", "Bmp1", "Col2a1", "Dmp1", 
                 "Col10a1", "Ibsp", "Bmp6", "Thbs2", "Phex", "Mmp9", "Thbs1", "Tgfb3", "Bmp3", "Fzd5", 
                 "Sfrp5", "Serpina3n", "Cd5l")

data_lable <- data[match(gene_select, data$GeneID), ]

p <- ggplot(data=data, aes(x=logFC, y=-log10(P.Value),color=regulated)) + 
  geom_point(size=1) + # 绘制点图，size为点的大小
  theme_set(theme_set(theme_bw(base_size=12))) +  # 设置绘图主题，选择了ggplot2自带的theme_bw
  xlab("log2(Fold change)") +  # 添加x轴标题
  ylab("-log10(PValue)") +     # 添加x轴标题
  labs(colour = "State") +     # 修改图例的标题
  scale_colour_manual(values = c(Down='darkblue',None='grey',Up='red')) +   # 修改点的颜色
  geom_vline(xintercept=c(-(log2(1.2)),log2(1.2)),lty=2,col="black",lwd=0.6) +  # 添加两条FC阈值竖线
  geom_hline(yintercept = -log10(0.05),lty=2,col="black",lwd=0.6)           # 添加PValue阈值横线
p

p1 <- p + 
  #geom_point(size = 3, shape = 1, data = label) +
  ggrepel::geom_text_repel(data=data_lable, aes(label = GeneID), color="black" ,max.overlaps = 20,
                           box.padding = unit(0.5, "lines"),point.padding = unit(0.8, "lines"),segment.color = "black")
p1

跟原文比，我们这个结果还是有点区别的，包括芯片标准化方法，差异分析算法，绘图横坐标等不一样。

结果如下：

下次见~

ixxmu / mp_duty