Error: OTU abundance data must have non-zero dimensions.

spriyansh commented 3 years ago

I am fairly new to phyolseq. I extracted the exact sequence variants using the dada pipeline published by Callahan et al.. I was following the phyloseq pipeline given over there for prevalence filtering, but I encountered an error saying,

Error in validObject(.Object) : invalid class “otu_table” object: OTU abundance data must have non-zero dimensions.

I made the phyloseq object without any error; it looks like,

phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 17197 taxa and 66 samples ]
sample_data() Sample Data:       [ 66 samples by 21 sample variables ]
tax_table()   Taxonomy Table:    [ 17197 taxa by 6 taxonomic ranks ]
phy_tree()    Phylogenetic Tree: [ 17197 tips and 17195 internal nodes ]

R command:

#  Define prevalence threshold as 5% of total samples
prevalenceThreshold = 0.05 * nsamples(ps1)
prevalenceThreshold

## Output [1] 3.3

# Execute prevalence filter, using `prune_taxa()` function
keepTaxa = rownames(prevdf1)[(prevdf1$Prevalence >= prevalenceThreshold)]
ps2 = prune_taxa(keepTaxa, ps0)

## Error in validObject(.Object) : invalid class “otu_table” object: 
 OTU abundance data must have non-zero dimensions.

ycl6 commented 3 years ago

Hi @spriyansh, I suspect you are using wrong ps object?

ps2 = prune_taxa(keepTaxa, ps0)

ps2 = prune_taxa(keepTaxa, ps1) # Is this what you want?

spriyansh commented 3 years ago

Hi @spriyansh, I suspect you are using wrong ps object?
ps2 = prune_taxa(keepTaxa, ps0)

ps2 = prune_taxa(keepTaxa, ps1) # Is this what you want?

Thank you for the response! I checked the output for keepTaxa = rownames(prevdf1)[(prevdf1$Prevalence >= prevalenceThreshold)] and it was character(0).

In the previous steps, when we create the data frame for the total and average prevalences, I found that the average prevalence for all the phyla is equivalent to 1. This is the reason I am not getting any pruned taxa.

Command:

temp_df <- plyr::ddply(prevdf, "Phylum", function(df1){cbind(mean(df1$Prevalence),sum(df1$Prevalence))})

Output

              Phylum     mean(df1$Prevalence)    sum(df1$Prevalence)
1    Acidobacteriota     1.000713                          1404
2   Actinobacteriota     1.025891                          3051
3     Armatimonadota     1.000000                          179
4       Bacteroidota     1.000000                          181
5   Bdellovibrionota     1.000000                          13
6        Chloroflexi     1.022583                          2581
7      Crenarchaeota     1.000000                          174
8      Cyanobacteria     1.000000                          159
9       Dadabacteria     1.000000                          2
10      Deinococcota     1.000000                          9

I don't think so mean(df1$Prevalence) should be 1 for all the phyla. Can you tell where I could have possibly gone wrong in the pipeline of Callahan et al.?

ycl6 commented 3 years ago

@spriyansh Without seeing your workflow and your data it's quite hard to debug. If you don't mind sharing your original ps object and how you process it up to the point where you create keepTaxa, then I can help take a look at it.

spriyansh commented 3 years ago

@spriyansh Without seeing your workflow and your data, it's quite hard to debug. If you don't mind sharing your original ps object and how you process it up to the point where you create keepTaxa, then I can help take a look at it.

@ycl6 Yes, here is the workflow

# Loading the library
library(phyloseq)
library(tidyverse)

######################
### Making Object ######
######################

# Reading taxonomic annotations
asv_tax <- read.csv("InputRaw/asv/ASV_tax.tsv", sep = "\t")

# Column to rownames
asv_tax <- asv_tax %>% tibble::column_to_rownames("ASV")

# Conversion to matrix
asv_tax <-  as.matrix(asv_tax)

# Reading Sequence Variants
asv_abundance <- read.csv("InputRaw/asv/ASV_abundance.tsv", sep = "\t")

# Column to rownames
asv_abundance <- asv_abundance %>% tibble::column_to_rownames("ASV")

# Conversion to matrix
asv_abundance <-  as.matrix(asv_abundance)

# Reading Metadata
metadata <- read.csv("InputRaw/asv/ASV_meta.tsv", sep = "\t")

# Adding X to s_number
metadata$s_number <- sub("^", "X", metadata$s_number)

# Column to rownames
metadata  <- metadata %>%  tibble::column_to_rownames("s_number") 

# Individual Objects
ASV <- otu_table(asv_abundance, taxa_are_rows = T)
Taxa <- tax_table(asv_tax)
Samples <- sample_data(metadata)

# Creating Object
ASV_PS <- phyloseq(ASV, Taxa, Samples)

# Summary
ASV_PS

# Saving to RDS
saveRDS(ASV_PS, "InputRDS/ASV_PS.RDS")
save(ASV_PS, file = "InputRDS/ASV_PS.RData")

###########################
#### PhyloSeq Analysis #######
###########################

# Viewing the Available ranks
rank_names(ASV_PS) 

# Creating a table for number of features for each phyla
table(tax_table(ASV_PS)[, "Phylum"], exclude = NULL)

# Removing the NAs and uncharacterized Phylums
ps0 <- subset_taxa(ASV_PS, !is.na(Phylum) & !Phylum %in% c("", "uncharacterized"))

# Summary Before and After
ASV_PS
ps0

# Compute prevalence of each feature, store as data.frame
prevdf = apply(X = otu_table(ps0),
               MARGIN = ifelse(taxa_are_rows(ps0), yes = 1, no = 2),
               FUN = function(x){sum(x > 0)})

# Add taxonomy and total read counts to this data.frame
prevdf = data.frame(Prevalence = prevdf,
                    TotalAbundance = taxa_sums(ps0),
                    tax_table(ps0))

# Viewing
head(prevdf)

# Computing average and total prevalence
temp_df <- plyr::ddply(prevdf, "Phylum", function(df1){cbind(mean(df1$Prevalence),sum(df1$Prevalence))})

# Vector to be removed
filterPhyla <- temp_df[temp_df$`2` < 50,]$Phylum

# Filter entries with unidentified Phylum.
ps1 = subset_taxa(ps0, !Phylum %in% filterPhyla)

# Summary
ASV_PS
ps0
ps1

# Subset to the remaining phyla
prevdf1 = subset(prevdf, Phylum %in% get_taxa_unique(ps1, "Phylum"))

#  Define prevalence threshold as 5% of total samples
prevalenceThreshold = 0.05 * nsamples(ps0)
prevalenceThreshold

## [1] 3.3

### Here is the error ###
# Execute prevalence filter, using `prune_taxa()` function
keepTaxa = rownames(prevdf1)[(prevdf1$Prevalence >= prevalenceThreshold)]
ps2 = prune_taxa(keepTaxa, ps0)

The following files are attached as zip,

Taxonomic Annotations in TSV format
ASV abundance in TSV format
Metadata in TSV format
Phyloseq Object Data.zip

ycl6 commented 3 years ago

Hi @spriyansh, I have a look at your data, unfortunately it does appear that's how the composition is in your dataset.

Almost ~99% of the ASVs are observed in a single sample, and therefore the mean (and median) will be very close to 1.

> table(prevdf$Prevalence)

    1     2     3 
17220   153    29

You can plot the distribution of Taxa Prevalence as below:

ggplot(prevdf, aes(Prevalence)) + geom_histogram(bins = 50) + theme_bw() + 
ggtitle("Histogram of Taxa Prevalence")

You can take a look at the OTU table and you'll observe that many of the rows have zero values in all but one of the samples.

> head(otu_table(ps0))
OTU Table:          [6 taxa and 66 samples]
                     taxa are rows
      X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X1 X20 X21 X22 X23 X24 X25 X26 X27 X28 X29 X2 X30 X32 X33 X34 X35 X36 X37 X38 X39 X3 X40
ASV_1   0   0   0   0   0   0   0   0   0   0  0   0   0   0   0   0   0   0   0   0   0  0   0   0   0   0   0   0   0   0   0  0   0
ASV_2   0   0   0   0   0   0   0   0   0   0  0   0   0   0   0   0   0   0   0   0   0  0   0   0   0   0   0   0   0   0   0  0 609
ASV_3   0   0   0   0   0   0   0   0   0   0  0   0   0   0   0   0   0   0   0   0   0  0   0   0   0   0   0   0   0   0   0  0   0
ASV_4   0   0   0   0   0   0   0   0   0   0  0   0   0   0   0   0   0   0   0   0   0  0   0   0   0   0   0   0   0   0   0  0   0
ASV_5   0   0 114   0   0   0   0   0   0   0  0   0   0   0   0   0   0   0   0 290   0  0   0   0   0 109   0   0   0   0   0  0   0
ASV_6   0   0   0   0   0   0   0   0   0   0  0   0   0   0   0   0   0   0   0   0 359  0   0   0   0   0   0   0   0   0   0  0   0
      X41 X42 X43 X44 X45 X46 X47 X48 X49 X4 X50 X51 X52 X53 X54 X55 X56 X58 X59 X60 X61 X62 X63 X64 X65 X66 X67 X68 X69 X6  X7 X8 X9
ASV_1   0   0   0   0   0   0   0   0   0  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0  0 660  0  0
ASV_2   0   0   0   0   0   0   0   0   0  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0  0   0  0  0
ASV_3   0   0   0   0   0   0   0 560   0  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0  0   0  0  0
ASV_4   0   0   0   0   0   0 527   0   0  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0  0   0  0  0
ASV_5   0   0   0   0   0   0   0   0   0  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0  0   0  0  0
ASV_6 135   0   0   0   0   0   0   0   0  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0  0   0  0  0

> head(prevdf)
      Prevalence TotalAbundance  Kingdom           Phylum          Class             Order           Family                 Genus
ASV_1          1            660 Bacteria Actinobacteriota Acidimicrobiia              <NA>             <NA>                  <NA>
ASV_2          1            609 Bacteria  Acidobacteriota Acidobacteriae    Bryobacterales  Bryobacteraceae            Bryobacter
ASV_3          1            560 Bacteria  Planctomycetota Planctomycetes     Isosphaerales   Isosphaeraceae         Tundrisphaera
ASV_4          1            527 Bacteria  Planctomycetota Planctomycetes     Isosphaerales   Isosphaeraceae Candidatus Nostocoida
ASV_5          3            513 Bacteria      Chloroflexi            AD3              <NA>             <NA>                  <NA>
ASV_6          2            494 Bacteria Actinobacteriota Actinobacteria Corynebacteriales Mycobacteriaceae         Mycobacterium

spriyansh commented 3 years ago

Thanks, @ycl6. I'll have a look at it. One more thing I did the same Phyloseq analysis on the OTUs extracted using Mothur MiSeq SOP, but that works perfectly fine (Mothur OTU_PS.zip). I mean, I understand both the dada2 and Mothur generate sequence variants at different similarities. However, I am still sceptical that most dada derived ASVs are observed in a single sample compared to their OTU counterparts from Mothur (at 3% dissimilarity).

Following is the dada2 pipeline that I used to generate ASVs. Would you mind looking at it? I also saved the r-objects after every step and can share them as well if you need one.

#Loading library
library(dada2)

# Setting seed
set.seed(20233083)

# filepath
path <- file.path("dadaInputFilteres")

# Reading filenames
fns <- sort(list.files(path, full.names = TRUE))
filtFs <- fns[grepl("R1", fns)]
filtRs <- fns[grepl("R2", fns)]

# Dereplication
derepFs <- derepFastq(filtFs)
derepRs <- derepFastq(filtRs)

# Sample Names
sam.names <- sapply(strsplit(basename(filtFs), "_"),'[', 1)

# Adding Sample information to dereplicated objects
names(derepFs) <- sam.names
names(derepRs) <- sam.names

# Error estimation
errorsF <- learnErrors(derepFs)
errorsR <- learnErrors(derepRs)

# Dada algorithm
dadaFs <- dada(derepFs, err=errorsF, pool=F, verbose=TRUE)
dadaRs <- dada(derepRs, err=errorsR, pool=F, verbose=TRUE)

# Meging
mergers <- mergePairs(dadaFs, derepFs, dadaRs, derepRs)

# ASV
seqtab.all <- makeSequenceTable(mergers[!grepl("Mock", names(mergers))])

# Chimera removed
# This is my ASV.tsv after t()
seqtab <- removeBimeraDenovo(seqtab.all)

# Silva Taxonomy
asv_silva_tax <- assignTaxonomy(seqtab,"silva_nr99_v138_train_set.fa.gz", tryRC=TRUE, multithread=TRUE)

ycl6 commented 3 years ago

@spriyansh Just one thing stands out, have you perform any trimming to your raw fastq files prior to derepFastq. This would include (1) removing PCR primers using tools such as cutadapt and (2) use filterAndTrim to perform additional trimming of end-bases. You should inspect read quality profiles using plotQualityProfile and choose an appropriate truncLen settings, as well as other settings.

For example:

# Define dada2's filtered outfile names
filtFs = file.path(filt, basename(fnFs)) # fnFs: R1 with PCR primer removed
filtRs = file.path(filt, basename(fnRs)) # fnRs: R2 with PCR primer removed

# Run filterAndTrim
out = filterAndTrim(fnFs, filtFs, fnRs, filtRs,
        truncLen = c(260,200), minLen = 200, maxN = 0, truncQ = 2, maxEE = c(2,5),
        rm.phix = TRUE, compress = TRUE, verbose = TRUE, multithread = TRUE)

out = as.data.frame(out)
rownames(out) = sample.names

spriyansh commented 3 years ago

@ycl6 I didn't use cutadapt, but yes, I did the trimming using dada as follows,

#Loading library
library(dada2)

# Setting seed
set.seed(20233083)

# Setting filename
raw_files <- file.path("unFiltered")
filtered_path <- file.path("filtered")

# Reading filename
fns <- sort(list.files(raw_files, full.names = TRUE))
fnFs <- fns[grepl("R1", fns)]
fnRs <- fns[grepl("R2", fns)]

# Sorting file-names
fns <- sort(list.files(raw_files, full.names = TRUE))

# Setting path for filtered reads
filtFs <- file.path(filtered_path, basename(fnFs))
filtRs <- file.path(filtered_path, basename(fnRs))

# Trimming
for(i in seq_along(fnFs)) {
  fastqPairedFilter(c(fnFs[[i]], fnRs[[i]]),
                    c(filtFs[[i]], filtRs[[i]]),
                    trimLeft=0, truncLen=250,
                    maxN=0, maxEE=2, truncQ=2,
                    compress=TRUE)
}

Here is the plot of quality scores, of the trimmed reads, multiqc_report.zip

ycl6 commented 3 years ago

@spriyansh Based on the multiqc plot (Overrepresented sequences) it shows you have PCR primer sequences in the reads. You have to remove primer sequences from your paired end reads before you can proceed with the dada2 workflow. If you are familiar with cutadapt, you can execute it in command line to process your files, if not, you can refer to the Identify Primers and Remove Primers sections in dada2 to adapt to your experiment, i.e. changing the primer sequences

spriyansh commented 3 years ago

@ycl6 Thank you for your response. I'll do the needful.

Cheers!

joey711 / phyloseq

Error: OTU abundance data must have non-zero dimensions. #1483