Closed spriyansh closed 3 years ago
Hi @spriyansh, I suspect you are using wrong ps object?
ps2 = prune_taxa(keepTaxa, ps0)
ps2 = prune_taxa(keepTaxa, ps1) # Is this what you want?
Hi @spriyansh, I suspect you are using wrong ps object?
ps2 = prune_taxa(keepTaxa, ps0) ps2 = prune_taxa(keepTaxa, ps1) # Is this what you want?
Thank you for the response! I checked the output for keepTaxa = rownames(prevdf1)[(prevdf1$Prevalence >= prevalenceThreshold)]
and it was character(0)
.
In the previous steps, when we create the data frame for the total and average prevalences, I found that the average prevalence for all the phyla is equivalent to 1. This is the reason I am not getting any pruned taxa.
Command:
temp_df <- plyr::ddply(prevdf, "Phylum", function(df1){cbind(mean(df1$Prevalence),sum(df1$Prevalence))})
Output
Phylum mean(df1$Prevalence) sum(df1$Prevalence)
1 Acidobacteriota 1.000713 1404
2 Actinobacteriota 1.025891 3051
3 Armatimonadota 1.000000 179
4 Bacteroidota 1.000000 181
5 Bdellovibrionota 1.000000 13
6 Chloroflexi 1.022583 2581
7 Crenarchaeota 1.000000 174
8 Cyanobacteria 1.000000 159
9 Dadabacteria 1.000000 2
10 Deinococcota 1.000000 9
I don't think so mean(df1$Prevalence)
should be 1 for all the phyla. Can you tell where I could have possibly gone wrong in the pipeline of Callahan et al.?
@spriyansh Without seeing your workflow and your data it's quite hard to debug. If you don't mind sharing your original ps
object and how you process it up to the point where you create keepTaxa
, then I can help take a look at it.
@spriyansh Without seeing your workflow and your data, it's quite hard to debug. If you don't mind sharing your original
ps
object and how you process it up to the point where you createkeepTaxa
, then I can help take a look at it.
@ycl6 Yes, here is the workflow
# Loading the library
library(phyloseq)
library(tidyverse)
######################
### Making Object ######
######################
# Reading taxonomic annotations
asv_tax <- read.csv("InputRaw/asv/ASV_tax.tsv", sep = "\t")
# Column to rownames
asv_tax <- asv_tax %>% tibble::column_to_rownames("ASV")
# Conversion to matrix
asv_tax <- as.matrix(asv_tax)
# Reading Sequence Variants
asv_abundance <- read.csv("InputRaw/asv/ASV_abundance.tsv", sep = "\t")
# Column to rownames
asv_abundance <- asv_abundance %>% tibble::column_to_rownames("ASV")
# Conversion to matrix
asv_abundance <- as.matrix(asv_abundance)
# Reading Metadata
metadata <- read.csv("InputRaw/asv/ASV_meta.tsv", sep = "\t")
# Adding X to s_number
metadata$s_number <- sub("^", "X", metadata$s_number)
# Column to rownames
metadata <- metadata %>% tibble::column_to_rownames("s_number")
# Individual Objects
ASV <- otu_table(asv_abundance, taxa_are_rows = T)
Taxa <- tax_table(asv_tax)
Samples <- sample_data(metadata)
# Creating Object
ASV_PS <- phyloseq(ASV, Taxa, Samples)
# Summary
ASV_PS
# Saving to RDS
saveRDS(ASV_PS, "InputRDS/ASV_PS.RDS")
save(ASV_PS, file = "InputRDS/ASV_PS.RData")
###########################
#### PhyloSeq Analysis #######
###########################
# Viewing the Available ranks
rank_names(ASV_PS)
# Creating a table for number of features for each phyla
table(tax_table(ASV_PS)[, "Phylum"], exclude = NULL)
# Removing the NAs and uncharacterized Phylums
ps0 <- subset_taxa(ASV_PS, !is.na(Phylum) & !Phylum %in% c("", "uncharacterized"))
# Summary Before and After
ASV_PS
ps0
# Compute prevalence of each feature, store as data.frame
prevdf = apply(X = otu_table(ps0),
MARGIN = ifelse(taxa_are_rows(ps0), yes = 1, no = 2),
FUN = function(x){sum(x > 0)})
# Add taxonomy and total read counts to this data.frame
prevdf = data.frame(Prevalence = prevdf,
TotalAbundance = taxa_sums(ps0),
tax_table(ps0))
# Viewing
head(prevdf)
# Computing average and total prevalence
temp_df <- plyr::ddply(prevdf, "Phylum", function(df1){cbind(mean(df1$Prevalence),sum(df1$Prevalence))})
# Vector to be removed
filterPhyla <- temp_df[temp_df$`2` < 50,]$Phylum
# Filter entries with unidentified Phylum.
ps1 = subset_taxa(ps0, !Phylum %in% filterPhyla)
# Summary
ASV_PS
ps0
ps1
# Subset to the remaining phyla
prevdf1 = subset(prevdf, Phylum %in% get_taxa_unique(ps1, "Phylum"))
# Define prevalence threshold as 5% of total samples
prevalenceThreshold = 0.05 * nsamples(ps0)
prevalenceThreshold
## [1] 3.3
### Here is the error ###
# Execute prevalence filter, using `prune_taxa()` function
keepTaxa = rownames(prevdf1)[(prevdf1$Prevalence >= prevalenceThreshold)]
ps2 = prune_taxa(keepTaxa, ps0)
The following files are attached as zip,
Hi @spriyansh, I have a look at your data, unfortunately it does appear that's how the composition is in your dataset.
Almost ~99% of the ASVs are observed in a single sample, and therefore the mean (and median) will be very close to 1.
> table(prevdf$Prevalence)
1 2 3
17220 153 29
You can plot the distribution of Taxa Prevalence as below:
ggplot(prevdf, aes(Prevalence)) + geom_histogram(bins = 50) + theme_bw() +
ggtitle("Histogram of Taxa Prevalence")
You can take a look at the OTU table and you'll observe that many of the rows have zero values in all but one of the samples.
> head(otu_table(ps0))
OTU Table: [6 taxa and 66 samples]
taxa are rows
X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X1 X20 X21 X22 X23 X24 X25 X26 X27 X28 X29 X2 X30 X32 X33 X34 X35 X36 X37 X38 X39 X3 X40
ASV_1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ASV_2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 609
ASV_3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ASV_4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ASV_5 0 0 114 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 290 0 0 0 0 0 109 0 0 0 0 0 0 0
ASV_6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 359 0 0 0 0 0 0 0 0 0 0 0 0
X41 X42 X43 X44 X45 X46 X47 X48 X49 X4 X50 X51 X52 X53 X54 X55 X56 X58 X59 X60 X61 X62 X63 X64 X65 X66 X67 X68 X69 X6 X7 X8 X9
ASV_1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 660 0 0
ASV_2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ASV_3 0 0 0 0 0 0 0 560 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ASV_4 0 0 0 0 0 0 527 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ASV_5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ASV_6 135 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> head(prevdf)
Prevalence TotalAbundance Kingdom Phylum Class Order Family Genus
ASV_1 1 660 Bacteria Actinobacteriota Acidimicrobiia <NA> <NA> <NA>
ASV_2 1 609 Bacteria Acidobacteriota Acidobacteriae Bryobacterales Bryobacteraceae Bryobacter
ASV_3 1 560 Bacteria Planctomycetota Planctomycetes Isosphaerales Isosphaeraceae Tundrisphaera
ASV_4 1 527 Bacteria Planctomycetota Planctomycetes Isosphaerales Isosphaeraceae Candidatus Nostocoida
ASV_5 3 513 Bacteria Chloroflexi AD3 <NA> <NA> <NA>
ASV_6 2 494 Bacteria Actinobacteriota Actinobacteria Corynebacteriales Mycobacteriaceae Mycobacterium
Thanks, @ycl6. I'll have a look at it. One more thing I did the same Phyloseq analysis on the OTUs extracted using Mothur MiSeq SOP, but that works perfectly fine (Mothur OTU_PS.zip). I mean, I understand both the dada2 and Mothur generate sequence variants at different similarities. However, I am still sceptical that most dada derived ASVs are observed in a single sample compared to their OTU counterparts from Mothur (at 3% dissimilarity).
Following is the dada2 pipeline that I used to generate ASVs. Would you mind looking at it? I also saved the r-objects after every step and can share them as well if you need one.
#Loading library
library(dada2)
# Setting seed
set.seed(20233083)
# filepath
path <- file.path("dadaInputFilteres")
# Reading filenames
fns <- sort(list.files(path, full.names = TRUE))
filtFs <- fns[grepl("R1", fns)]
filtRs <- fns[grepl("R2", fns)]
# Dereplication
derepFs <- derepFastq(filtFs)
derepRs <- derepFastq(filtRs)
# Sample Names
sam.names <- sapply(strsplit(basename(filtFs), "_"),'[', 1)
# Adding Sample information to dereplicated objects
names(derepFs) <- sam.names
names(derepRs) <- sam.names
# Error estimation
errorsF <- learnErrors(derepFs)
errorsR <- learnErrors(derepRs)
# Dada algorithm
dadaFs <- dada(derepFs, err=errorsF, pool=F, verbose=TRUE)
dadaRs <- dada(derepRs, err=errorsR, pool=F, verbose=TRUE)
# Meging
mergers <- mergePairs(dadaFs, derepFs, dadaRs, derepRs)
# ASV
seqtab.all <- makeSequenceTable(mergers[!grepl("Mock", names(mergers))])
# Chimera removed
# This is my ASV.tsv after t()
seqtab <- removeBimeraDenovo(seqtab.all)
# Silva Taxonomy
asv_silva_tax <- assignTaxonomy(seqtab,"silva_nr99_v138_train_set.fa.gz", tryRC=TRUE, multithread=TRUE)
@spriyansh Just one thing stands out, have you perform any trimming to your raw fastq files prior to derepFastq
. This would include (1) removing PCR primers using tools such as cutadapt
and (2) use filterAndTrim
to perform additional trimming of end-bases. You should inspect read quality profiles using plotQualityProfile
and choose an appropriate truncLen
settings, as well as other settings.
For example:
# Define dada2's filtered outfile names
filtFs = file.path(filt, basename(fnFs)) # fnFs: R1 with PCR primer removed
filtRs = file.path(filt, basename(fnRs)) # fnRs: R2 with PCR primer removed
# Run filterAndTrim
out = filterAndTrim(fnFs, filtFs, fnRs, filtRs,
truncLen = c(260,200), minLen = 200, maxN = 0, truncQ = 2, maxEE = c(2,5),
rm.phix = TRUE, compress = TRUE, verbose = TRUE, multithread = TRUE)
out = as.data.frame(out)
rownames(out) = sample.names
@ycl6 I didn't use cutadapt
, but yes, I did the trimming using dada as follows,
#Loading library
library(dada2)
# Setting seed
set.seed(20233083)
# Setting filename
raw_files <- file.path("unFiltered")
filtered_path <- file.path("filtered")
# Reading filename
fns <- sort(list.files(raw_files, full.names = TRUE))
fnFs <- fns[grepl("R1", fns)]
fnRs <- fns[grepl("R2", fns)]
# Sorting file-names
fns <- sort(list.files(raw_files, full.names = TRUE))
# Setting path for filtered reads
filtFs <- file.path(filtered_path, basename(fnFs))
filtRs <- file.path(filtered_path, basename(fnRs))
# Trimming
for(i in seq_along(fnFs)) {
fastqPairedFilter(c(fnFs[[i]], fnRs[[i]]),
c(filtFs[[i]], filtRs[[i]]),
trimLeft=0, truncLen=250,
maxN=0, maxEE=2, truncQ=2,
compress=TRUE)
}
Here is the plot of quality scores, of the trimmed reads, multiqc_report.zip
@spriyansh Based on the multiqc plot (Overrepresented sequences) it shows you have PCR primer sequences in the reads. You have to remove primer sequences from your paired end reads before you can proceed with the dada2 workflow. If you are familiar with cutadapt
, you can execute it in command line to process your files, if not, you can refer to the Identify Primers and Remove Primers sections in dada2 to adapt to your experiment, i.e. changing the primer sequences
@ycl6 Thank you for your response. I'll do the needful.
Cheers!
I am fairly new to phyolseq. I extracted the exact sequence variants using the dada pipeline published by Callahan et al.. I was following the phyloseq pipeline given over there for prevalence filtering, but I encountered an error saying,
Error in validObject(.Object) : invalid class “otu_table” object: OTU abundance data must have non-zero dimensions.
I made the phyloseq object without any error; it looks like,
R command: