ampliseq fails during taxonomy assignation when processing ITS sequences

eperezv commented 9 months ago

Description of the bug

Hello, I was processing my ITS data using ampliseq and I found a problem. The log can be found below.

I think the problem is directly related to the option

--cut_its

I have checked many things and I think the problem is caused by ITSX, which produces a few very short ASVs (<50 nt) that crash R during taxonomic assignation. I don't know why ITSx is producing that very short ASVs (I think it shouldn't, or there should be a way of filtering them out). I have tried to remove short ASVs with the --min_len_asv but it seems it works at other step and not right before taxonomic assignation.

My solution for now run ampliseq, wait for it to crash. Then, manually remove short ASVs and re-run ampliseq.

I also have a feature suggestion, which is allowing the user to choose the taxonomic group in ITSx. I modified the code manually because I am only interested in amplicons related to fungi.

This is the log:

Feb-11 18:49:48.111 [Task submitter] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
  task: name=NFCORE_AMPLISEQ:AMPLISEQ:DADA2_TAXONOMY_WF:DADA2_TAXONOMY (ASV_ITS_seqs.ITS2.fasta,assignTaxonomy.fna); work-dir=/home/user/ampliseq/work>
  error [nextflow.exception.ProcessUnrecoverableException]: Process requirement exceeds available memory -- req: 128 GB; avail: 125.6 GB
Feb-11 18:49:48.120 [Task submitter] ERROR nextflow.processor.TaskProcessor - Error executing process > 'NFCORE_AMPLISEQ:AMPLISEQ:DADA2_TAXONOMY_WF:DADA2_TAXONOMY (ASV_ITS_seqs.ITS2.fasta,assignTaxonomy.>

Caused by:
  Process requirement exceeds available memory -- req: 128 GB; avail: 125.6 GB

Command executed:

  #!/usr/bin/env Rscript
      suppressPackageStartupMessages(library(dada2))
      set.seed(100) # Initialize random number generator for reproducibility

      taxlevels <- c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species")

      seq <- getSequences("ASV_ITS_seqs.ITS2.fasta", collapse = TRUE, silence = FALSE)
      taxa <- assignTaxonomy(seq, "assignTaxonomy.fna", taxLevels = taxlevels, minBoot = 50,tryRC = FALSE, multithread = 16, verbose=TRUE, outputBootstraps = TRUE)

      # (1) Make a data frame, add ASV_ID from seq
      tx <- data.frame(ASV_ID = names(seq), taxa, sequence = row.names(taxa$tax), row.names = names(seq))

      # (2) Set confidence to the bootstrap for the most specific taxon
      # extract columns with taxonomic values
      tax <- tx[ , grepl( "tax." , names( tx ) ) ]
      # find first occurrence of NA
      res <- max.col(is.na(tax), ties = "first")
      # correct if no NA is present in column to NA
      if(any(res == 1)) is.na(res) <- (res == 1) & !is.na(tax[[1]])
      # find index of last entry before NA, which is the bootstrap value
      res <- res-1
      # if NA choose last entry
      res[is.na(res)] <- ncol(tax)
      # extract bootstrap values
      boot <- tx[ , grepl( "boot." , names( tx ) ) ]
      boot$last_tax <- res
      valid_boot <- apply(boot,1,function(x) x[x[length(x)]][1]/100 )
      # replace missing bootstrap values (NA) with 0
      valid_boot[is.na(valid_boot)] <- 0
      # add bootstrap values to column confidence
      tx$confidence <- valid_boot

      # (3) Reorder columns before writing to file
      expected_order <- c("ASV_ID",paste0("tax.",taxlevels),"confidence","sequence")
      expected_order <- intersect(expected_order,colnames(tx))
      taxa_export <- subset(tx, select = expected_order)
      colnames(taxa_export) <- sub("tax.", "", colnames(taxa_export))
      rownames(taxa_export) <- names(seq)

      write.table(taxa_export, file = "ASV_ITS_tax.unite-fungi_9_0.tsv", sep = "\t", row.names = FALSE, col.names = TRUE, quote = FALSE, na = '')

      # Save a version with rownames for addSpecies
      taxa_export <- cbind( ASV_ID = tx$ASV_ID, taxa$tax, confidence = tx$confidence)
      saveRDS(taxa_export, "ASV_tax.rds")

      write.table('assignTaxonomy       minBoot = 50,tryRC = FALSE
  taxlevels     c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species")
  seed  100', file = "assignTaxonomy.args.txt", row.names = FALSE, col.names = FALSE, quote = FALSE, na = '')
      writeLines(c("\"NFCORE_AMPLISEQ:AMPLISEQ:DADA2_TAXONOMY_WF:DADA2_TAXONOMY\":", paste0("    R: ", paste0(R.Version()[c("major","minor")], collapse = ".")),paste0("    dada2: ", packageVersion("dada2>

Command exit status:
  -

Command output:
  UNITE fungal taxonomic reference detected.
  Finished processing reference fasta.

Command error:
  address (nil), cause 'unknown'

   *** caught bus error ***
  address (nil), cause 'unknown'

   *** caught bus error ***
  address (nil), cause 'unknown'

   *** caught bus error ***
  address (nil), cause 'unknown'

   *** caught bus error ***
  address (nil), cause 'unknown'

   *** caught bus error ***
  address (nil), cause 'unknown'

   *** caught bus error ***
  address (nil), cause 'unknown'

   *** caught bus error ***
  address (nil), cause 'unknown'

   *** caught bus error ***
  address (nil), cause 'unknown'

   *** caught bus error ***
  address (nil), cause 'unknown'

   *** caught bus error ***
  address (nil), cause 'unknown'

   *** caught bus error ***
  address (nil), cause 'unknown'

   *** caught segfault ***
  Lost warning messages
  Error: no more error handlers available (recursive errors?); invoking 'abort' restart
  address 0x7fcb27105e90, cause 'memory not mapped'
  An irrecoverable exception occurred. R is aborting now ...
  Error in C_assign_taxonomy2(seqs, rc(seqs), refs, ref.to.genus, tax.mat.int,  :
    bad value

    bad value
  *** stack smashing detected ***: <unknown> terminated

   *** caught segfault ***
  address 0x7feb27105e30, cause 'memory not mapped'
  Error in C_assign_taxonomy2(seqs, rc(seqs), refs, ref.to.genus, tax.mat.int,  :
    bad value

    bad value
  /home/user/ampliseq/work/fa/350c6921628c38a6dad20522b1cc10/.command.run: line 158:    56 Aborted                 (core dumped) /usr/bin/env Rscript >

Work dir:
  /home/user/ampliseq/work/b7/1fb9c2b7ec29b3d2002376c50c8f8f

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

Command used and terminal output

No response

Relevant files

No response

System information

No response

d4straub commented 9 months ago

It looks to me the error hints to another problem:

Caused by:
  Process requirement exceeds available memory -- req: 128 GB; avail: 125.6 GB

Could you use "--max_memory 120.GB" ?

eperezv commented 9 months ago

I saw it, yes. I didn't try "--max_memory 120.GB" yet. But I thought it made no sense to require 128 GB for that task with a few thousand ASVs. I monitored RAM use when crashing and it was not even using 50 GB RAM. After removing a few (ca 100) ASVs < 50nt, it works perfectly.

eperezv commented 9 months ago

I have tried with the "--max_memory 120.GB" option. Same error (but it tried several times, that was the difference).

d4straub commented 9 months ago

Sorry for the late reply, wasnt working for a week.

Some info might be helpful for troubleshooting:

exact command that started ampliseq, if using any configs with parameter, please supply those
pipeline version
nextflow version

As far as I understood ITSx is generating sub-sequences of ASVs, therefore filtering ASVs by length before cutting the ITS region might not help here. You could tests whether --its_partial 50 (50 is an example here!) helps (but that might have unwanted side-effects, such as allowing more sequences to pass).

One step further would be to supply ITSx (i.e. process ITSX_CUTASV) with the exact parameters that you want it to have. Append to your command to run the pipeline -c itsx.config where itsx.config contains:

process {
    withName: ITSX_CUTASV {
        ext.args = '-t all --preserve T --date F --positions F --graphical F --minlen 50'
    }
}

(that contents are by default when using --cut_its full, plus --minlen 50 which should omit any short sequences and therefore might solve your issue).

Please let me know how these approaches work for you in order to further improve pipeline reliability.

d4straub commented 9 months ago

About:

I also have a feature suggestion, which is allowing the user to choose the taxonomic group in ITSx. I modified the code manually because I am only interested in amplicons related to fungi.

Forgot to mention that modifying the pipeline code invalidates all my troubleshooting and I cannot support personal code-altered copies of this pipeline. I hope that isnst the proble here though (If yes I made a mistake to sink time into this...). To make that right, use a config as above to modify -t all to the value you are interested in. Using configs is fine, altering the code makes me not able to help.

eperezv commented 9 months ago

Forgot to mention that modifying the pipeline code invalidates all my troubleshooting and I cannot support personal code-altered copies of this pipeline. I hope that isnst the proble here though (If yes I made a mistake to sink time into this...). To make that right, use a config as above to modify -t all to the value you are interested in. Using configs is fine, altering the code makes me not able to help.

The problem was already there before any code modification. I edited it later while trying to understand the pipeline and the issue

eperezv commented 8 months ago

Hi, I have finally tried what you proposed:

process {
    withName: ITSX_CUTASV {
        ext.args = '-t all --preserve T --date F --positions F --graphical F --minlen 50'
    }
}

Same error, I see reads shorter than 50 nt, which are not removed before taxonomic assignation

d4straub commented 8 months ago

Hi there, that is a pity that this wasnt the solution. Again questions that you didnt answer yet:

Some info might be helpful for troubleshooting: exact command that started ampliseq, if using any configs with parameter, please supply those pipeline version nextflow version

Other than that I am out of ideas, maybe @jtangrot has more ideas?

jtangrot commented 8 months ago

Hi! Unfortunately, the --minlen option in ITSx will not have any effect in this case, it only applies when concatenating regions. To filter the output the option I know of is to use --partial (--its_partial in ampliseq). However, as Daniel mentioned, that will also allow partial matches to the region, which might not be desirable. I'm a bit surprised that ITSx outputs that short ITS2 regions, but if that is the case I wonder why the taxonomy assignment fails with short sequences. I guess that could in theory also happen with other ASVs?

d4straub commented 8 months ago

I wonder why the taxonomy assignment fails with short sequences. I guess that could in theory also happen with other ASVs?

Yes, as far as I remember, it happened already before that short ASVs were stalling the DADA2 classifier, but it was possible to circumvent the issue by filtering for ASV length. This solution seems not applicable in this case.

d4straub commented 8 months ago

So how can we solve this? Would there be a filtering needed between the ITSx output and taxonomic classification? If so, could that be directly in the process or should we add a separate process?

d4straub commented 7 months ago

After a little research, I found an github issue for DADA2 where indeed the threshold of 50bp is mentioned: https://github.com/benjjneb/dada2/issues/601. In https://github.com/benjjneb/dada2/issues/326 is a similar issue for the reference taxonomy, which is set to min 20bp.

d4straub commented 7 months ago

I have added a length filter after ITSX that removes any sequences below 50bp. The threshold of 50 can be changed with a config file, if desired. This change is in the dev branch and will be in the next release.

eperezv commented 7 months ago

Hi, I'm sorry that I was busy and couldn't be active on the discussion. I have tried the latest version and it completely solves the issue. Thanks a lot!

nf-core / ampliseq