原来在R中也能处理fastqc by YuYuFiSH

做过RNA-Seq上游的小伙伴肯定知道

主要五个有步骤

准备参考基因组文件下载及GTF文件下载
质控（fastqc, multiqc）
数据过滤 (trim_galore)
序列比对 (star)
表达定量 (featureCounts)

而质量控制（QC）是其中关键的步骤之一。常用软件FastQC可以为每一个样品生成一个html报告和一个‘zip’ 文件，zip解压之后生成fastqc_data.txt和summary.txt的文件，里面包含了测序样品的质量信息

MultiQC软件在其基础上，把所有样本的质量信息综合起来，形成html报告，方便大家快速查看和评估

但以上两个软件均需要在linux环境下执行。当后续工作需要自定义报告或者美化结果图时，在R语言下会更加容易实现，如R包fastqcr，但目前只仅限于 FastQC 文件，并且对于多个样本结果的整合可视化还是比较困难的。

因此，2020年Christopher M Ward等人推出了R包ngsReports，可以将FastQC结果和常见的NGS log文件输入到R语言中。还可以用交互方式轻松准确地比较各个处理阶段的输出，生成简单、可定制的报告

文献：

ngsReports: a Bioconductor package for managing FastQC reports and other NGS related log files | Bioinformatics | Oxford Academic (oup.com)

说明书：

An Introduction To ngsReports (bioconductor.org)
ngsReports: Load FastqQC reports and other NGS related files (bioconductor.org)

报告解读：

测序数据质量控制之FastQC

1. R包安装

#BiocManager::install("ngsReports")
library(ngsReports)

2. 数据载入

这里直接载入R包自带的6个fastqc数据作例子

转换成 FastqcDataList 格式

# Get the files included with the package
fileDir <- system.file("extdata", package = "ngsReports")
files <- list.files(fileDir, pattern = "fastqc.zip$", full.names = TRUE)
files
# Load the FASTQC data as a FastqcDataList object
fdl <- FastqcDataList(files)
# FastqcDataList for 6 files.

> files
[1] "D:/R-4.1.1/library/ngsReports/extdata/ATTG_R1_fastqc.zip"
[2] "D:/R-4.1.1/library/ngsReports/extdata/ATTG_R2_fastqc.zip"
[3] "D:/R-4.1.1/library/ngsReports/extdata/CCGC_R1_fastqc.zip"
[4] "D:/R-4.1.1/library/ngsReports/extdata/CCGC_R2_fastqc.zip"
[5] "D:/R-4.1.1/library/ngsReports/extdata/GACC_R1_fastqc.zip"
[6] "D:/R-4.1.1/library/ngsReports/extdata/GACC_R2_fastqc.zip"

3. 十二大主题报告

getModule(fdl[[1]], "Summary")
# # A tibble: 12 x 3
# Filename      Status Category                    
# <chr>         <chr>  <chr>                       
# 1 ATTG_R1.fastq PASS   Basic Statistics  基本信息          
# 2 ATTG_R1.fastq FAIL   Per base sequence quality  各个位置测序质量
# 3 ATTG_R1.fastq WARN   Per tile sequence quality 每个tile的测序质量  
# 4 ATTG_R1.fastq PASS   Per sequence quality scores 序列质量分数
# 5 ATTG_R1.fastq FAIL   Per base sequence content  序列碱基分布
# 6 ATTG_R1.fastq FAIL   Per sequence GC content   序列平均GC分布  
# 7 ATTG_R1.fastq PASS   Per base N content   N含量分布        
# 8 ATTG_R1.fastq PASS   Sequence Length Distribution 序列长度分布
# 9 ATTG_R1.fastq FAIL   Sequence Duplication Levels 重复序列的频率
# 10 ATTG_R1.fastq FAIL  Overrepresented sequences 重复序列  
# 11 ATTG_R1.fastq FAIL  Adapter Content 接头序列            
# 12 ATTG_R1.fastq FAIL  Kmer Content 计算每个重复段序列出现的次数

4. 生成html文件

# Copy these files to tempdir() to avoid overwriting
# any files in the package directory
tempdir()
file.copy(files, tempdir(), overwrite = TRUE)
writeHtmlReport(fastqcDir = tempdir())

主要功能

上面报告中每一条结果对应的函数

plotReadTotals(fdl)
plotSummary(fdl)

plotBaseQuals(fdl)
plotSeqQuals(fdl)
plotSeqContent(fdl)
plotGcContent(fdl)
plotSeqLengthDistn(fdl)
plotDupLevels(fdl)
plotAdapterContent(fdl)
plotOverrep(fdl)

5.shiny交互报告

#remotes::install_github("UofABioinformaticsHub/shinyNgsreports")
library(ngsReports)
library(shinyNgsreports)

fastqcShiny(fdl)

5. 定制你的结果

以 plotSummary 作例子，可以看帮助文档得到它的参数基本格式

格式如下：

# ?plotSummary
## S4 method for signature 'FastqcDataList'
plotSummary(
x,
usePlotly = FALSE,
labels,
pwfCols, # 颜色
cluster = FALSE,
dendrogram = FALSE,
...,
gridlineWidth = 0.2,
gridlineCol = "grey20"
)

其实，大概猜到，完全可以加上ggplot2进行修饰

This uses the standard ggplot2 syntax to create a three colour plot. The output of this function canbe further modified using the standard ggplot2 methods if required.

可以通过调整颜色进一步美化

这个时候就要看大家ggplot2的功底了！

# 1.展示颜色
getColours(pwf)
# PASS      WARN      FAIL       MAX 
# "#00CC00" "#E6E633" "#CC3333" "#FFFFFF"

# 2. 透明度
pwf2 <- setAlpha(pwf, 0.7)

# 3.重新定义颜色
# 需要rgb格式
# setColours(object, PASS, WARN, FAIL, MAX)
pwf2=setColours(pwf,rgb(0.6,0.86,0.63),
                rgb(0.99,0.68,0.38),
                rgb(0.83,0.24,0.31),
                rgb(1,1,1))
# 4.重新画图
plotSummary(fdl,
            pwfCols=pwf2,# 颜色
            gridlineWidth=0.8, # 线条宽度
            gridlineCol = "black")+ # 线条颜色
  theme(axis.text= element_text(face = 'bold'))+ # 坐标轴字体
  labs(x="",y="") # 坐标轴标签

授人以鱼不如授人以渔

大家完全可以美化其他的结果图

6. 导入log日志

除了处理fastqc文件，还可以导入多个文件输出的log日志，不过每次只支持一种工具

如下：

Adapter removal and trimming

AdapterRemoval
cutadapt
trimmomatic

Mapping and alignment

bowtie
bowtie2
hisat2
samtools flagstat
STAR
picard MarkDuplicates

Transcript/gene quantification

featureCounts

Genome assembly

BUSCO
Quast

例如看基因比对结果

fls <- c("bowtiePE.txt", "bowtieSE.txt")
bowtieLogs <- system.file("extdata", fls, package = "ngsReports")
df <- importNgsLogs(bowtieLogs, type = "bowtie")

df %>%
  dplyr::select("Filename", starts_with("Reads")) %>%
  pander(
    split.tables = Inf,
    style = "rmarkdown", 
    big.mark = ",",
    caption = "Select columns as an example of output from bowtie."
  )

ixxmu / mp_duty

原来在R中也能处理fastqc #2194