Resume after bwa aln not working

jcgrenier commented 2 years ago

Hello! I'm running eager v.2.4.4 on a set of paired-end samples but I've had some issues with DamageProfiler, a step ongoing after the mapping. I modified my config file so DamageProfiler could but launched using more memory. However, when I'm trying to restart my pipeline with the -resume option, it's starting again at the mapping step. The aligned files don't seems to have been cached. Is there a way to fix this?

Thanks a lot for your help! Jean-Christophe

jfy133 commented 2 years ago

Hi @jcgenier could send a the command you are using? We need to be able to replicate the error to identify where the problem is (unfortunately finding the causes of resume problems can be tricky)

jcgrenier commented 2 years ago

Sure, here it is!

#!/bin/bash
module load nextflow singularity

export NXF_OFFLINE='TRUE'
export NXF_EXECUTOR=slurm
export NXF_SINGULARITY_CACHEDIR="~/singularity_containers"

nextflow run nf-core/eager -profile singularity -c ../paired_end_nf-core_eager.config -resume insane_darwin

My custom config file :

params.input ="M*R{1,2}*001.fastq.gz"
params.fasta='~/GRCh37/Sequence/WholeGenomeFasta/genome.fa'
params.bwa_index= '~/GRCh37/Sequence/BWAIndex/version0.7.x'
params.fasta_index= '~/GRCh37/Sequence/WholeGenomeFasta/genome.fa.fai'
params.seq_dict= '~/GRCh37/Sequence/WholeGenomeFasta/genome.dict'

run_genotyping = true
genotyping_tool = 'hc'
genotyping_source = 'raw'
gatk_ploidy = 2

NFX_OPTS='-Xms512M -Xmx10G'
NXF_JVM_ARGS='-Xmx10G'

params {
  config_profile_name = 'NF-core eager Paired-end profile'
  config_profile_description = 'Pipeline to run paired-end ancient DNA samples on a SLURM system.'
  executor = 'slurm'
  clusterOptions = '--account=ctb'

  max_cpus = 12
  max_memory = 40.GB
  max_time = 24.h
}

process {
  withName:'bwa'{
    cpus = 48
    memory = 40.GB
    time = 72.h
  }
  withName:'damageprofiler'{
  cpus = 1
}
}

Thanks for looking into it!

jfy133 commented 2 years ago

I @jcgrenier I'm unable to replicate locally with our normal (mini) test data, I will need to find something bigger and see if I can replicate in a more realisitic scenario. Unfortunately this will take a bit more time as I am organising and running a summer school over the next two weeks.

In the meantime could you send me a rough estimate of how large your FASTQ files are?

Also, could you see what happens if you don't specify the specific run name when -resumeing?

jcgrenier commented 2 years ago

Hello @jfy133 , Thanks for trying out with your test data. The Fastq files in my dataset are varying from 300Mb to 13G (and it's paired-end data). So basically, you can imagine that the problematic ones, for which damageprofiler was not able to run completely, are the bigger ones. I also tried without resuming it with the run name, and it gave the same thing. Yesterday thought, I was able resume a test run which had less samples in it. So I'm not really sure why it was not cached in the first place. Would there be a way to manually set it up? Thanks!

jfy133 commented 2 years ago

Hmm ok. That's very strange if it does work sometimes (that shouldn't really happen - unless something is messing with the work/ dir...)

What do you mean by

Manually set it up

?

If you can - could you try again with just the large samples (so reduce the number of samples, but wit hthe 'problematic' ones - just to ensure this isn't some stoachsitic thing?

I just want to make sure it is a systematic thing first

jcgrenier commented 2 years ago

Sure, I can do some more tests! What I meant, was if it was possible to start the pipeline at some place manually knowing that some files already exists, let's say my bam files here in this case.

Thanks!

jcgrenier commented 2 years ago

Hello @jfy133 ,

I did another test run and now it works..! I don't know why it did that the first time. Some samples crashed at this step due to the walltime, so I manually increased it when it retried it. So maybe it's due to this error that it didn't work! Thanks for looking that up for me!

JC

jfy133 commented 2 years ago

Ok! Thanks for the update. Let's close the issue for now, but can reopen if it happens again!

nf-core / eager

Resume after bwa aln not working #906