Quality control of code

cathreenj commented 5 years ago

Hi, I reanalysed my Illumina miseq data and found some discrepancies between the old and my new results. I will post my code and the R output here in the hope of receiving some feedback on the accuracy of my coding... I would very much appreciate it if someone could tell me if what I did is correct. Thank you very much.

This section demonstrates the “full stack” of amplicon bioinformatics: construction of the sample-by-sequence

feature table from the raw reads, assignment of taxonomy, and creation of a phylogenetic tree relating the sample sequences.

First we load the necessary packages.

library("knitr") library("BiocStyle") .cran_packages <- c("ggplot2", "gridExtra") .bioc_packages <- c("dada2", "phyloseq", "DECIPHER", "phangorn") .inst <- .cran_packages %in% installed.packages() if(any(!.inst)) {

install.packages(.cran_packages[!.inst])

} .inst <- .bioc_packages %in% installed.packages() if(any(!.inst)) {

source("http://bioconductor.org/biocLite.R")

biocLite(.bioc_packages[!.inst], ask = F)

}
Load packages into session, and print package version

sapply(c(.cran_packages, .bioc_packages), require, character.only = TRUE) ggplot2 gridExtra dada2 phyloseq DECIPHER phangorn TRUE TRUE TRUE TRUE TRUE TRUE set.seed(100) getwd() [1] "/home/aut/AG_Loibner/Daten/Seq_data_mcra/Map9/Reads_retrimmed/Original_reads" setwd("/home/aut/AG_Loibner/Daten/Seq_data_mcra/Map9/Reads_retrimmed/Original_reads/") getwd() [1] "/home/aut/AG_Loibner/Daten/Seq_data_mcra/Map9/Reads_retrimmed/Original_reads" miseq_path <- "./MCRA_Lehen.fq/" # CHANGE to the directory containing the fastq files after unzipping. list.files(miseq_path) [1] "M1-T1_R1.fastq" "M1-T1_R2.fastq" "M2-T1_R1.fastq" "M2-T1_R2.fastq" "M4-T2_R1.fastq" "M4-T2_R2.fastq" "M5-T2_R1.fastq" "M5-T2_R2.fastq" [9] "M6-T2_R1.fastq" "M6-T2_R2.fastq"

Inspect read quality, sort reads, Trim and filter

Sort ensures forward/reverse reads are in same order

fnFs <- sort(list.files(miseq_path, pattern="_R1.fastq")) fnRs <- sort(list.files(miseq_path, pattern="_R2.fastq"))

Extract sample names, assuming filenames have format: SAMPLENAME_XXX.fastq

sampleNames <- sapply(strsplit(fnFs, "_"), [, 1) fnFs <- file.path(miseq_path, fnFs) fnRs <- file.path(miseq_path, fnRs) fnFs[1:3] [1] "./MCRA_Lehen.fq//M1-T1_R1.fastq" "./MCRA_Lehen.fq//M2-T1_R1.fastq" "./MCRA_Lehen.fq//M4-T2_R1.fastq" fnRs[1:3] [1] "./MCRA_Lehen.fq//M1-T1_R2.fastq" "./MCRA_Lehen.fq//M2-T1_R2.fastq" "./MCRA_Lehen.fq//M4-T2_R2.fastq"

Most Illumina sequencing data shows a trend of decreasing average quality towards the end of sequencing reads.

Plot quality of the first two forward reads:

plotQualityProfile(fnFs[1:5]) plotQualityProfile(fnRs[1:5])

pdf("qualityProfile.pdf")

toPlot<-c() for (i in 1:length(fnFs)){

toPlot<-c(toPlot,fnFs[i],fnRs[i])

} head(toPlot) [1] "./MCRA_Lehen.fq//M1-T1_R1.fastq" "./MCRA_Lehen.fq//M1-T1_R2.fastq" "./MCRA_Lehen.fq//M2-T1_R1.fastq" "./MCRA_Lehen.fq//M2-T1_R2.fastq" [5] "./MCRA_Lehen.fq//M4-T2_R1.fastq" "./MCRA_Lehen.fq//M4-T2_R2.fastq" plotQualityProfile(fnFs[1:length(fnFs)]) plotQualityProfile(fnFs[1:length(fnRs)])

ggsave("qualityPlot.svg",

plotQualityProfile(toPlot)+

geom_hline(yintercept=30)+

geom_hline(yintercept=20)+

geom_vline(xintercept=100)+

geom_vline(xintercept=245)

) Saving 8.72 x 7 in image

Save plot, add horizontal line

Based on the quality analysis we will trimm at position 245 for the forward strand and 100 for the reverse strand

Place filtered files in filtered/ subdirectory

filt_path<-file.path(".","filtered") if(!file_test("-d", filt_path)) dir.create(filt_path) filtFs <- file.path(filt_path, basename(fnFs)) filtRs <- file.path(filt_path, basename(fnRs))

out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs,

trimLeft=c(58,64), maxN=0, maxEE=c(2,6), minQ=2,

compress=FALSE, verbose=T, multithread=TRUE) # trimLeft used to remove forward and reverse primer sequences print(out) reads.in reads.out M1-T1_R1.fastq 14367 5012 M2-T1_R1.fastq 20036 4193 M4-T2_R1.fastq 19142 2519 M5-T2_R1.fastq 16947 1839 M6-T2_R1.fastq 18387 4534

derepFs <- derepFastq(filtFs) Not all sequences were the same length. Not all sequences were the same length. Not all sequences were the same length. Not all sequences were the same length. Not all sequences were the same length. derepRs <- derepFastq(filtRs) Not all sequences were the same length. Not all sequences were the same length. Not all sequences were the same length. Not all sequences were the same length. Not all sequences were the same length. sam.names <- sapply(strsplit(basename(filtFs), "_"), [, 1) names(derepFs) <- sam.names names(derepRs) <- sam.names

ddF<-dada(derepFs[1:5], err=NULL, selfConsist=TRUE) Initializing error rates to maximum possible estimate. Sample 1 - 5012 reads in 2340 unique sequences. Sample 2 - 4193 reads in 2066 unique sequences. Sample 3 - 2519 reads in 1255 unique sequences. Sample 4 - 1839 reads in 1118 unique sequences. Sample 5 - 4534 reads in 1791 unique sequences. selfConsist step 2 selfConsist step 3 selfConsist step 4 Convergence after 4 rounds. ddR<-dada(derepRs[1:5], err=NULL, selfConsist=TRUE) Initializing error rates to maximum possible estimate. Sample 1 - 5012 reads in 3096 unique sequences. Sample 2 - 4193 reads in 2622 unique sequences. Sample 3 - 2519 reads in 1607 unique sequences. Sample 4 - 1839 reads in 1292 unique sequences. Sample 5 - 4534 reads in 2558 unique sequences. selfConsist step 2 selfConsist step 3 selfConsist step 4 selfConsist step 5 Convergence after 5 rounds.

dadaFs <- dada(derepFs, err=ddF[[1]]$err_out, pool=TRUE) 5 samples were pooled: 18097 reads in 7175 unique sequences. dadaRs <- dada(derepRs, err=ddR[[1]]$err_out, pool=TRUE) 5 samples were pooled: 18097 reads in 9851 unique sequences.

mergers <- mergePairs(dadaFs, derepFs, dadaRs, derepRs)

################################################################################ #

Mapping

# ################################################################################ seqtab.all <- makeSequenceTable(mergers[!grepl("Mock", names(mergers))]) The sequences being tabled vary in length. seqtab <- removeBimeraDenovo(seqtab.all) As of the 1.4 release, the default method changed to consensus (from pooled). seqtab.nochim <- removeBimeraDenovo(seqtab, method="consensus", multithread=TRUE, verbose=TRUE) Identified 0 bimeras out of 32 input sequences.

Here we need to set the taxonomy to the mcrA fasta file

We have to pass a file formatted in the following way to assignTaxonomy

>Level1;Level2;Level3;Level4;Level5;Level6;

ACCTAGAAAGTCGTAGATCGAAGTTGAAGCATCGCCCGATGATCGTCTGAAGCTGTAGCATGAGTCGATTTTCACATTCAGGGATACCATAGGATAC

>Level1;Level2;Level3;Level4;Level5;

CGCTAGAAAGTCGTAGAAGGCTCGGAGGTTTGAAGCATCGCCCGATGGGATCTCGTTGCTGTAGCATGAGTACGGACATTCAGGGATCATAGGATAC

We got the list of genes from FunGen, selecting the 10000 best genes (from 10484)

Yang, Sizhong; Liebner, Susanne; Alawi, Mashal; EbenhÃ¶h, Oliver; Wagner, Dirk (2014): Supplement to: Taxonomic database and cutoff value for processing mcrA gene 454 pyrosequencing data by MOTHUR. Deutsches GeoForschungsZentrum GFZ. http://doi.org/10.5880/GFZ.4.5.2014.001

Sequenzdatenbank umformatieren von >Accno. Seq to >taxonomy +Seq

getwd()

ref_fasta <- "/home/aut/AG_Loibner/Daten/Seq_data_mcra/ref_db_mcra/db_mcra_genus.fasta" taxtab <- assignTaxonomy(seqtab, refFasta = ref_fasta, tryRC=T, multithread=T)

taxtabCath <- assignTaxonomy(seqtab, refFasta = ref_fastaCath,tryRC=T,multithread=T)

Add species. Mapping must 100%

taxtab<-addSpecies(taxtab,"./mcrASequenceDBAssignSpecies.fa",verbose=T)

help("addSpecies")

Trick to add species without 100% mapping

taxtabWithSpec <- assignTaxonomy(seqtab, refFasta="./mcrASequenceDBAssignTaxonomyWithSpecies.fa",tryRC=T,multithread=T,verbose=T)

colnames(taxtab) <- c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus")

Prepare phyloseq elements

library(DECIPHER)

To construct a phylogenetic tree:

seqs <- getSequences(seqtab)

names(seqs) <- seqs # This propagates to the tip labels of the tree

alignment <- AlignSeqs(DNAStringSet(seqs), anchor=NA)

#

Construct phylogenetic tree

phang.align <- phyDat(as(alignment, "matrix"), type="DNA")

dm <- dist.ml(phang.align)

treeNJ <- NJ(dm) # Note, tip order != sequence order

fit = pml(treeNJ, data=phang.align)

#

negative edges length changed to 0!

#

fitGTR <- update(fit, k=4, inv=0.2)

fitGTR <- optim.pml(fitGTR, model="GTR", optInv=TRUE, optGamma=TRUE,

rearrangement = "stochastic", control = pml.control(trace = 0))

detach("package:phangorn", unload=TRUE)

Combine data into a phyloseq object

mimarks_path <- "./SampleInformation.csv" samdf<-read.csv(mimarks_path, header=TRUE) Error in file(file, "rt") : cannot open the connection In addition: Warning message: In file(file, "rt") : cannot open file './SampleInformation.csv': No such file or directory samdf$SampleID<-samdf$sample_id # taxa.print <- taxtab # Removing sequence rownames for display only rownames(taxa.print) <- NULL

head(taxa.print)

head(seqtab.nochim)

samdf <- read.table("/home/aut/AG_Loibner/Daten/Seq_data_mcra/meta.data.txt", header = TRUE) print(samdf) SampleID Type Depth.m. Site Month Year 1 M1-T1 Reservoir 1200 LEH-002 November 2016 2 M2-T1 Reservoir 1200 LEH-002 November 2016 3 M4-T2 Reservoir 1350 LEH-002 November 2016 4 M5-T2 Reservoir 1350 LEH-002 November 2016 5 M6-T2 Reservoir 1350 LEH-002 November 2016 rownames(samdf) <- rownames(seqtab.nochim) ps <- phyloseq(tax_table(taxtab), sample_data(samdf),

otu_table(seqtab, taxa_are_rows = FALSE)) library("phyloseq") packageVersion("phyloseq") [1] ‘1.22.3’

remotes::install_github("cpauvert/psadd")

library("psadd") plot_krona(ps,variable="SampleID",output="test2plot") Writing test2plot.html... Warning message: In dir.create(output) : 'test2plot' already exists

ps2 = filter_taxa(ps, function(x) mean(x) > 0.1, TRUE) ps2 phyloseq-class experiment-level object otu_table() OTU Table: [ 32 taxa and 5 samples ] sample_data() Sample Data: [ 5 samples by 6 sample variables ] tax_table() Taxonomy Table: [ 32 taxa by 6 taxonomic ranks ] plot_krona(ps2,variable="SampleID",output="ps2plot") Writing ps2plot.html... Warning message: In dir.create(output) : 'ps2plot' already exists ps3 = transform_sample_counts(ps2, function(x) x / sum(x) ) ps3 phyloseq-class experiment-level object otu_table() OTU Table: [ 32 taxa and 5 samples ] sample_data() Sample Data: [ 5 samples by 6 sample variables ] tax_table() Taxonomy Table: [ 32 taxa by 6 taxonomic ranks ]

benjjneb commented 5 years ago

On a brief glance, I would be a bit concerned about the small number of ASVs ("OTUs") that you are getting at the end of this. Can you tell us what amplicon region you are sequencing? What is your primer set? I suspect that you have cut your reads too short (truncLen too small) and are losing almost all your reads during merging because they no longer overlap.

Also, the figures didn't seem to post in your message, could you possibly upload in particular the quality profiles?

cathreenj commented 5 years ago

Hi benjjneb, thanks a lot for taking the time. First things first: I sequenced the mcra gene, coding for the Alpha subunit of the Methyl Com Reductase. The Primers were 1046f and 1435r. I used a corresponding reference database. Regarding the truncLen, as far as I know I didn't use the truncLen parameter... I know from the raw sequence summary statistics, that a substantial proportion of my reads only had a length of around 20 bp... so I guess, that there was a problem with the sequencing of my samples. I added my quality plots in this comment.

Thank you again

Cathrine

qualityPlot

benjjneb commented 5 years ago

I sequenced the mcra gene, coding for the Alpha subunit of the Methyl Com Reductase. The Primers were 1046f and 1435r. I used a corresponding reference database.

So the amplicons will be ~400 nts in length? Do you have a prior expectation of how tight the length distro is for mcrA (I'm not familiar with these locus).

Regarding the truncLen, as far as I know I didn't use the truncLen parameter...

I see. That is probably OK, although if the amplicon is consistently shorter than 410 nts I would probably remove a bit of the end of the reverse reads given the quality drop there, something like truncLen=c(0, 200).

I know from the raw sequence summary statistics, that a substantial proportion of my reads only had a length of around 20 bp... so I guess, that there was a problem with the sequencing of my samples.

That could explain some of the loss of reads. You could also filter those out with minLen=c(50, 50) e.g. in the filtering step. The key question is how many reads you've kept at the end of the workflow. What is the output of sample_sums(ps)?

cathreenj commented 5 years ago

Hi Benjjneb, sorry for the delay... Yes the amplicons have ~400bps, I couldn't find any info on the expected length distribution. We don't have a bioanalyzer here, so I couldn't check myself. I tried to truncate the last few bases on the reverse reads, but it didn't really make a difference. Many of my reads are very short...

out_tr <- filterAndTrim(fnFs_tr, filtFs.tr, fnRs_tr, filtRs.tr,

minLen = c(50,50), truncLen = c(0, 200), maxN=0, maxEE=c(2,5), minQ=2,

compress=FALSE, verbose=T, multithread=TRUE) # On Windows set multithread=FALSE print(out_tr) reads.in reads.out M1-T1_R1_trimmed.fastq 13691 4482 M2-T1_R1_trimmed.fastq 19217 3503 M4-T2_R1_trimmed.fastq 18028 1722 M5-T2_R1_trimmed.fastq 16042 1014 M6-T2_R1_trimmed.fastq 17724 3968 M7-R28_R1_trimmed.fastq 18974 1662 M9-R30_R1_trimmed.fastq 9444 3807

so the output of sample_sums is:

sample_sums(ps_tr) M1-T1 M2-T1 M4-T2 M5-T2 M6-T2 4134 3322 1700 1005 3838 Thank you for your time, Cathrine

benjjneb commented 5 years ago

Many of my reads are very short...

If that's the case, then it might just be unavoidable that you are losing most of your reads in the filtering step, just not a great sequencing run.

It does seem that after filtering, >90% of the reads are making it through to the end of the pipeline, which is a good sign. So, given the apparent issues with data quality, you are probably OK to use the denoised data, and just have to accept that the library sizes aren't that big after quality control.

cathreenj commented 5 years ago

Ok! That's good news! Thank you very much fpr your help!

spholmes / F1000_workflow

Quality control of code #31

This section demonstrates the “full stack” of amplicon bioinformatics: construction of the sample-by-sequence

feature table from the raw reads, assignment of taxonomy, and creation of a phylogenetic tree relating the sample sequences.

First we load the necessary packages.

Load packages into session, and print package version

Inspect read quality, sort reads, Trim and filter

Sort ensures forward/reverse reads are in same order

Extract sample names, assuming filenames have format: SAMPLENAME_XXX.fastq

Most Illumina sequencing data shows a trend of decreasing average quality towards the end of sequencing reads.

Plot quality of the first two forward reads:

pdf("qualityProfile.pdf")

Save plot, add horizontal line

Based on the quality analysis we will trimm at position 245 for the forward strand and 100 for the reverse strand

Place filtered files in filtered/ subdirectory

Mapping

Here we need to set the taxonomy to the mcrA fasta file

We have to pass a file formatted in the following way to assignTaxonomy

>Level1;Level2;Level3;Level4;Level5;Level6;

ACCTAGAAAGTCGTAGATCGAAGTTGAAGCATCGCCCGATGATCGTCTGAAGCTGTAGCATGAGTCGATTTTCACATTCAGGGATACCATAGGATAC

>Level1;Level2;Level3;Level4;Level5;

CGCTAGAAAGTCGTAGAAGGCTCGGAGGTTTGAAGCATCGCCCGATGGGATCTCGTTGCTGTAGCATGAGTACGGACATTCAGGGATCATAGGATAC

We got the list of genes from FunGen, selecting the 10000 best genes (from 10484)

Yang, Sizhong; Liebner, Susanne; Alawi, Mashal; EbenhÃ¶h, Oliver; Wagner, Dirk (2014): Supplement to: Taxonomic database and cutoff value for processing mcrA gene 454 pyrosequencing data by MOTHUR. Deutsches GeoForschungsZentrum GFZ. http://doi.org/10.5880/GFZ.4.5.2014.001

Sequenzdatenbank umformatieren von >Accno. Seq to >taxonomy +Seq

getwd()

taxtabCath <- assignTaxonomy(seqtab, refFasta = ref_fastaCath,tryRC=T,multithread=T)

Add species. Mapping must 100%

taxtab<-addSpecies(taxtab,"./mcrASequenceDBAssignSpecies.fa",verbose=T)

Trick to add species without 100% mapping

taxtabWithSpec <- assignTaxonomy(seqtab, refFasta="./mcrASequenceDBAssignTaxonomyWithSpecies.fa",tryRC=T,multithread=T,verbose=T)

Prepare phyloseq elements

To construct a phylogenetic tree:

seqs <- getSequences(seqtab)

names(seqs) <- seqs # This propagates to the tip labels of the tree

alignment <- AlignSeqs(DNAStringSet(seqs), anchor=NA)

Construct phylogenetic tree

phang.align <- phyDat(as(alignment, "matrix"), type="DNA")

dm <- dist.ml(phang.align)

treeNJ <- NJ(dm) # Note, tip order != sequence order

fit = pml(treeNJ, data=phang.align)

negative edges length changed to 0!

fitGTR <- update(fit, k=4, inv=0.2)

fitGTR <- optim.pml(fitGTR, model="GTR", optInv=TRUE, optGamma=TRUE,

rearrangement = "stochastic", control = pml.control(trace = 0))

detach("package:phangorn", unload=TRUE)

Combine data into a phyloseq object

head(taxa.print)

head(seqtab.nochim)

remotes::install_github("cpauvert/psadd")