nf-core / eager

A fully reproducible and state-of-the-art ancient DNA analysis pipeline
https://nf-co.re/eager
MIT License
141 stars 80 forks source link

PMDTools error exit status (141) #567

Closed narning1992 closed 3 years ago

narning1992 commented 3 years ago

Hey eager dev team, first off I wanted to say I really love eager it makes the whole dna in silico extraction so much easier especially for someone new to the field like myself. I downloaded a few test libraries to run it on and actually a lot of them worked. However, for some I'm experiencing the following error:

[c9/27e998] NOTE: Process pmdtools (ERR3728226_raw_1.fastq.combined.fq.mapped_rmdup) terminated with an error exit status (141) -- Error is ignored

This would normally not be a problem as the error is ignored, but I am trying to use the pmdtools output for genotyping as I only want ancient genomes. However the genotyping is never started as I can see from the stdout and by the folder in outputs not being created. I have looked into the working directory of c9/27e998... to look at the log files but .command.log , .command.out., .command.trace and command.err are empty. Only the .exitcode has the number 141 in it. If I run bash .command.run it runs through without errors and another .exitcode is created with 141 in it. So maybe the problem is just that the downstream genotyping is not started bc it assumes pmdtools failed due to the exitcode.

I am using the following command for eager (I made a few adjustments as I'm running this on an hpc environment with multiple instances of eager running in parallel): nextflow run ./eager_analysis/eager \ -profile singularity \ --paired_end \ --reads./reads/ERR3728226_raw*1.fastq.gz \ --fasta ./ref_genomes/Treponema_maltophilum.fa \ --outdir ./outputs/ \ --w ./work_dir/ \ --max_memory 100.GB \ --max_cpus 5 \ --run_pmdtools \ --run_genotyping \ --genotyping_tool 'ug' \ --gatk_ug_jar ./GenomeAnalysisTK.jar \ --gatk_ploidy 1 \ --gatk_ug_out_mode 'EMIT_ALL_SITES' \ --gatk_ug_genotype_model 'SNP' \ --genotyping_source 'pmd' \ --with-singularity ./nfcore-eager-2.1.0.img \ --max_time '12.h'

I use the following reads for the analysis: https://www.ncbi.nlm.nih.gov/sra/?term=ERR3728226

As I am not too familiar with the nextflow pipeline this could be a very trivial error. I apologise if that is the case. Sorry also for the verbose issue just wanted to be as precise as possible. I hope this is all the info you need.

Thank you! Nick

jfy133 commented 3 years ago

Hi Nick,

Firstly, I just want to say that your issue was not at all verbose and is actually the perfect level of detail: I am able to run exactly what you tried running so I can exactly replicate the issue, so this is perfect - thank you!

A few notes:

1) You should be careful with your --reads, you must put these in quotes otherwise bash will evaluate the wildcard and not nextflow. This means you will run the forward and reverse reads separately, rather than merge. 2) To specify the read pair, you need to wrap the F/R indicators with {}, so for example I ran with --reads ERR3728226_{1,2}.fastq.gz 2) I would actually recommend using -r dev (you can cite the hash commit if needed for the version) rather than 2.1.0, there are a few bugs in 2.1.0 that have been resolved in -r dev and is generally more robust. Note there are some structural changes, so check your command before you re-run data. 3) For

pmdtools output for genotyping as I only want ancient genomes.

This maybe a misunderstanding (or my misunderstanding from your phrasing); PMDtools only gives you reads that likely have true damage - however not all aDNA reads will necessarily have damage, only a certain frequency of them will do. Depending on your overall coverage and level of damage, you may be unecessarily removing most of your data making genotyping very difficult (and thus downstream analysis). Assuming you get sufficient coverage, and overall you see expected damage profiles across the sum of all reads (from DamageProfiler section of MultiQC), you can still consider the genome to be 'ancient' (a good reference for additional authentication criteria: https://doi.org/10.1146/annurev-genom-091416-035526). Apologies if you know this already and you specifically are interested on genotyping only reads with damage for a specific analysis, but I've come across this misunderstanding before so want everything to be clear!)

Back to your reported error:

You are right, as you selected your genotyping source to be 'pmd'. the genotyping step will not proceed unless PMDtools successfully runs.

141 is called when a pipe is shut 'early' by a preceeding command in the pipe-chain. This is due to the way PMDTools takes input from samtools view, and maximum number of reads is specified (--pmdtools_max_reads) an early closure is reported. By default in nf-core/eager 2.1.0 has a max of 10k reads, and you have three times that in your mapping. This is a known issue: https://github.com/pontussk/PMDtools/issues/7.

A quick work around is to just use all reads e.g. by setting --pmdtools_max_reads 50000 (or something insane like 1million), something I just tested this on `-r 2.1.0) and it ran through.

That said, you have reminded me a better work around was posted here which I will implement in -r dev now!

narning1992 commented 3 years ago

Hi James, thank you so much for your quick reply. That was super insightful! I ran the code with your recommendations and it worked like a charm. Didn't know that bit about using the pmdtools output. Ill just use the "raw" default flag then instead.

Thanks again! Nick

jfy133 commented 3 years ago

Regardless, after much pain with Bash, got a more streamlined procedure to work allowing continuation of the pipeline even if 141 is reported by PMDtools. See #568

Will close this issue once that is merged into -r dev.