nf-core / ampliseq

Amplicon sequencing analysis workflow using DADA2 and QIIME2
https://nf-co.re/ampliseq
MIT License
182 stars 115 forks source link

Phyloseq object creation will fail if any samples have all reads removed by the tax filtering step #662

Closed agrier-wcm closed 10 months ago

agrier-wcm commented 10 months ago

Description of the bug

Phyloseq object creation will fail if any samples have all reads removed by the tax filtering step. The metadata and the OTU table will not agree about the set of sample IDs.

Command used and terminal output

No response

Relevant files

No response

System information

No response

d4straub commented 10 months ago

Thanks for the report, that indeed requires fixing.

d4straub commented 10 months ago

I tested: nextflow run nf-core/ampliseq -r 2.7.1 -profile test_failed,singularity --outdir result_test_failed_2-7-1 that is loosing a sample during early QC. This sample is missing in the ASV table but phyloseq works just fine. nextflow run nf-core/ampliseq -r 2.7.1 -profile test_failed,singularity --outdir result_test_failed_2-7-1 --min_frequency 135 --exclude_taxa Moraxellaceae --diversity_rarefaction_depth 100 --skip_diversity_indices --skip_ancom to make it also loose ASV so that an additional sample will loose all counts during taxonomic filtering (QIIME2_FILTERTAXA), and consequentely only 3 samples are left in the ASV table. But the pipeline is still fine and the phyloseq object is created.

@agrier-wcm I'll need more information here, I cannot reproduce the problem.

agrier-wcm commented 10 months ago

Thanks for promptly looking into this @d4straub. I have a relatively small dataset that works fine with 2.6.1, but I get this error with 2.7.1:

Error executing process > 'NFCORE_AMPLISEQ:AMPLISEQ:PHYLOSEQ_WORKFLOW:PHYLOSEQ (dada2)'

Caused by:
  Process `NFCORE_AMPLISEQ:AMPLISEQ:PHYLOSEQ_WORKFLOW:PHYLOSEQ (dada2)` terminated with an error exit status (1)

Command executed:

  #!/usr/bin/env Rscript

  suppressPackageStartupMessages(library(phyloseq))

  otu_df  <- read.table("reformat_filtered-table.tsv", sep="\t", header=TRUE, row.names=1)
  tax_df  <- read.table("ASV_tax_species.silva.tsv", sep="\t", header=TRUE, row.names=1)
  otu_mat <- as.matrix(otu_df)
  tax_mat <- as.matrix(tax_df)

  OTU     <- otu_table(otu_mat, taxa_are_rows=TRUE)
  TAX     <- tax_table(tax_mat)
  phy_obj <- phyloseq(OTU, TAX)

  if (file.exists("Metadata.tsv")) {
      sam_df  <- read.table("Metadata.tsv", sep="\t", header=TRUE, row.names=1)
      SAM     <- sample_data(sam_df)
      phy_obj <- merge_phyloseq(phy_obj, SAM)
  }

  if (file.exists("")) {
      TREE    <- read_tree("")
      phy_obj <- merge_phyloseq(phy_obj, TREE)
  }

  saveRDS(phy_obj, file = paste0("dada2", "_phyloseq.rds"))

  # Version information
  writeLines(c("\"NFCORE_AMPLISEQ:AMPLISEQ:PHYLOSEQ_WORKFLOW:PHYLOSEQ\":",
      paste0("    R: ", paste0(R.Version()[c("major","minor")], collapse = ".")),
      paste0("    phyloseq: ", packageVersion("phyloseq"))),
      "versions.yml"
  )

Command exit status:
  1

Command output:
  (empty)

Command error:
  Error in validObject(.Object) : invalid class “phyloseq” object: 
   Component sample names do not match.
   Try sample_names()
  Calls: merge_phyloseq ... do.call -> new -> initialize -> initialize -> validObject
  Execution halted

It does have a sample that loses all reads to the tax filtering step, so I assumed that was the problem because I thought for phyloseq object creation the set of samples in the Metadata file would have to match the set in the OTU table. Maybe there's something else going on. I'm testing a couple other small datasets now with 2.7.1 to see if I can reproduce the error with another dataset. Here is the command I was using, for reference:

nextflow run ampliseq -profile singularity --input ./SampleSheet.tsv --FW_primer GTGYCAGCMGCCGCGGTAA --RV_primer CCGYCAATTYMTTTRAGTTT --metadata ./Metadata.tsv --outdir ./results --email my.email@me --dada_ref_taxonomy silva --ignore_empty_input_files --ignore_failed_trimming --min_frequency 10 --retain_untrimmed --trunclenf 240 --trunclenr 160 --metadata_category_barplot "TumorLocation,SampleTissueType" --tax_agglom_max 7 --max_memory 32.GB

Does your test also use silva as the taxonomic database? I will have more information shortly from the tests that I am running, but I'm not exactly sure what to interrogate if the problem is not the tax filter thing. I will update this evening.

agrier-wcm commented 10 months ago

Well, it ended up being something stupid. In the Metadata file, some samples had a space at the end of the sample names, which they did not have in the SampleSheet. Somewhat interestingly, this did not cause any problems in 2.6.1, even for the comparisons/group assignments. Sorry to waste your time @d4straub , but thank you for looking into it. Closing.

d4straub commented 10 months ago

I see, happens to the best of us ;) Thanks for returning feedback.