nf-core / ampliseq

Amplicon sequencing analysis workflow using DADA2 and QIIME2
https://nf-co.re/ampliseq
MIT License
186 stars 117 forks source link

Pipeline fails when run with a lot of cores #763

Open kdivilov opened 3 months ago

kdivilov commented 3 months ago

Description of the bug

I'm analyzing a 16S dataset with ~1,200 samples spread across 3 runs (unfortunately this dataset is not public yet so I can't provide a reproducible example). I've found a bug in the ampliseq pipeline where if I run the pipeline with 120 cores it will fail (specifically at the diversity step) but if I run it with 20 cores it will finish without any errors. I believe the issue is that the workflow gets ahead of itself due to the number of cores available and starts a qiime2 module before another requisite qiime2 module finishes.

Command used and terminal output

nextflow run nf-core/ampliseq -r 2.10.0 -profile singularity \
--input samplesheet.tsv \
--metadata metadata.tsv \
--outdir nfcore_ampliseq_GTDB \
--min_read_counts 1000 \
--ignore_empty_input_files \
--ignore_failed_trimming \
--ignore_failed_filtering \
--skip_cutadapt \
--trunclenf 200 \
--trunclenr 150 \
--vsearch_cluster \
--filter_ssu "bac" \
--exclude_taxa "mitochondria,chloroplast,archaea" \
--metadata_category_barplot "condition" \
--tax_agglom_max 7 \
--picrust \
--ancombc \
--dada_ref_taxonomy gtdb=R09-RS220 \
--dada_taxonomy_rc

Relevant files

No response

System information

Nextflow v23.10.1 nf-core/ampliseq v2.10.0 singularity v4.1.3 slurm v23.02.1

d4straub commented 3 months ago

Hi there, thanks for the report. A few details are missing: (1) How do you specify the number of cores, I dont see that in the command. (2) What is the actual error message, for example the .nextflow.log of the failed run would include that

I believe the issue is that the workflow gets ahead of itself due to the number of cores available and starts a qiime2 module before another requisite qiime2 module finishes.

That should be impossible, if it were to happen it would be indeed a bug. You could also use in future a more up to date nextflow version. But I somehow doubt that this is the cause.

d4straub commented 3 months ago

Close due to lack of information.

kdivilov commented 3 months ago

Sorry for the delay. I have attached the log file. I updated nextflow to v24.04.3 for this run. I specified the number of cores using the '-c' option in slurm's sbatch.

nf_log.txt

d4straub commented 3 months ago

Thanks! The error message in that log file is:

Jul-19 16:20:51.757 [TaskFinalizer-5] ERROR nextflow.processor.TaskProcessor - Error executing process > 'NFCORE_AMPLISEQ:AMPLISEQ:QIIME2_DIVERSITY:QIIME2_DIVERSITY_BETA (unweighted_unifrac_distance_matrix - HatcheryGut_Stayton_vs_HatcheryGut_Marion)'

Caused by:
  Process `NFCORE_AMPLISEQ:AMPLISEQ:QIIME2_DIVERSITY:QIIME2_DIVERSITY_BETA (unweighted_unifrac_distance_matrix - HatcheryGut_Stayton_vs_HatcheryGut_Marion)` terminated with an error exit status (1)

Command executed:

  export XDG_CONFIG_HOME="./xdgconfig"
  export MPLCONFIGDIR="./mplconfigdir"
  export NUMBA_CACHE_DIR="./numbacache"

  qiime diversity beta-group-significance \
      --i-distance-matrix unweighted_unifrac_distance_matrix.qza \
      --m-metadata-file metadata.tsv \
      --m-metadata-column "HatcheryGut_Stayton_vs_HatcheryGut_Marion" \
      --o-visualization unweighted_unifrac_distance_matrix-HatcheryGut_Stayton_vs_HatcheryGut_Marion.qzv \
      --p-pairwise
  qiime tools export \
      --input-path unweighted_unifrac_distance_matrix-HatcheryGut_Stayton_vs_HatcheryGut_Marion.qzv \
      --output-path beta_diversity/unweighted_unifrac_distance_matrix-HatcheryGut_Stayton_vs_HatcheryGut_Marion

  cat <<-END_VERSIONS > versions.yml
  "NFCORE_AMPLISEQ:AMPLISEQ:QIIME2_DIVERSITY:QIIME2_DIVERSITY_BETA":
      qiime2: $( qiime --version | sed '1!d;s/.* //' )
  END_VERSIONS

Command exit status:
  1

Command output:
  (empty)

Command error:
  QIIME is caching your current deployment for improved performance. This may take a few moments and should only happen once per deployment.
  Plugin error from diversity:

    [Errno 2] No such file or directory: '/tmp/qiime2/divilovk/processes/40-1721431222.49@divilovk/9b75043b-6c1f-454e-8320-ce1123afdd55.4883255971269328549/9b75043b-6c1f-454e-8320-ce1123afdd55' -> '/tmp/qiime2/divilovk/data/9b75043b-6c1f-454e-8320-ce1123afdd55'

  Debug info has been saved to /tmp/qiime2-q2cli-err-y03qaw_e.log

Work dir:
  /nfs6/core/scratch/divilovk/couch/microbiome/GTDB/work/ff/fcf14e1f706db6f3665c1098af479f

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`
d4straub commented 3 months ago

I am not sure what causes this error. It seems to me pretty likely that its a problem with the tmp dir (maybe full or was deleted at that specific timepoint coincidentally). My hypothesis is that it was an coincident that the job didnt finish when you specified 120 cpus and succeeded with 20 cpus (because QIIME2_DIVERSITY_BETA is running with 2 cores only by default). Even if you would have modified the cpus, my alternative hypothesis is that by doing so the job was started the second time on a different node that had a working tmp dir. To test that hypothesis you would need to run the pipeline repeatedly with the high cores at different time points and potentially contact your sys admin about the tmp dir to make sure none of them is fills all available disc space. Let me know if after several attempts (use -resume, but avoid chaching by deleting the work dir of QIIME2_DIVERSITY_BETA every time or modify the config) the pipeline still fails with the high core number.

Next step might be: There is a discussion in the QIIME2 forum that might be related with some troubleshooting, see here.

kdivilov commented 3 months ago

Changing the tmp dir to one that has 130 TB of free space produces the same error except that /tmp now points to the new tmp dir.

d4straub commented 1 week ago

Did you figure out the issue or did you find a fix?

kdivilov commented 1 week ago

No, sorry.