Open kdivilov opened 3 months ago
Hi there, thanks for the report. A few details are missing: (1) How do you specify the number of cores, I dont see that in the command. (2) What is the actual error message, for example the .nextflow.log of the failed run would include that
I believe the issue is that the workflow gets ahead of itself due to the number of cores available and starts a qiime2 module before another requisite qiime2 module finishes.
That should be impossible, if it were to happen it would be indeed a bug. You could also use in future a more up to date nextflow version. But I somehow doubt that this is the cause.
Close due to lack of information.
Sorry for the delay. I have attached the log file. I updated nextflow to v24.04.3 for this run. I specified the number of cores using the '-c' option in slurm's sbatch.
Thanks! The error message in that log file is:
Jul-19 16:20:51.757 [TaskFinalizer-5] ERROR nextflow.processor.TaskProcessor - Error executing process > 'NFCORE_AMPLISEQ:AMPLISEQ:QIIME2_DIVERSITY:QIIME2_DIVERSITY_BETA (unweighted_unifrac_distance_matrix - HatcheryGut_Stayton_vs_HatcheryGut_Marion)'
Caused by:
Process `NFCORE_AMPLISEQ:AMPLISEQ:QIIME2_DIVERSITY:QIIME2_DIVERSITY_BETA (unweighted_unifrac_distance_matrix - HatcheryGut_Stayton_vs_HatcheryGut_Marion)` terminated with an error exit status (1)
Command executed:
export XDG_CONFIG_HOME="./xdgconfig"
export MPLCONFIGDIR="./mplconfigdir"
export NUMBA_CACHE_DIR="./numbacache"
qiime diversity beta-group-significance \
--i-distance-matrix unweighted_unifrac_distance_matrix.qza \
--m-metadata-file metadata.tsv \
--m-metadata-column "HatcheryGut_Stayton_vs_HatcheryGut_Marion" \
--o-visualization unweighted_unifrac_distance_matrix-HatcheryGut_Stayton_vs_HatcheryGut_Marion.qzv \
--p-pairwise
qiime tools export \
--input-path unweighted_unifrac_distance_matrix-HatcheryGut_Stayton_vs_HatcheryGut_Marion.qzv \
--output-path beta_diversity/unweighted_unifrac_distance_matrix-HatcheryGut_Stayton_vs_HatcheryGut_Marion
cat <<-END_VERSIONS > versions.yml
"NFCORE_AMPLISEQ:AMPLISEQ:QIIME2_DIVERSITY:QIIME2_DIVERSITY_BETA":
qiime2: $( qiime --version | sed '1!d;s/.* //' )
END_VERSIONS
Command exit status:
1
Command output:
(empty)
Command error:
QIIME is caching your current deployment for improved performance. This may take a few moments and should only happen once per deployment.
Plugin error from diversity:
[Errno 2] No such file or directory: '/tmp/qiime2/divilovk/processes/40-1721431222.49@divilovk/9b75043b-6c1f-454e-8320-ce1123afdd55.4883255971269328549/9b75043b-6c1f-454e-8320-ce1123afdd55' -> '/tmp/qiime2/divilovk/data/9b75043b-6c1f-454e-8320-ce1123afdd55'
Debug info has been saved to /tmp/qiime2-q2cli-err-y03qaw_e.log
Work dir:
/nfs6/core/scratch/divilovk/couch/microbiome/GTDB/work/ff/fcf14e1f706db6f3665c1098af479f
Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`
I am not sure what causes this error. It seems to me pretty likely that its a problem with the tmp dir (maybe full or was deleted at that specific timepoint coincidentally). My hypothesis is that it was an coincident that the job didnt finish when you specified 120 cpus and succeeded with 20 cpus (because QIIME2_DIVERSITY_BETA is running with 2 cores only by default). Even if you would have modified the cpus, my alternative hypothesis is that by doing so the job was started the second time on a different node that had a working tmp dir. To test that hypothesis you would need to run the pipeline repeatedly with the high cores at different time points and potentially contact your sys admin about the tmp dir to make sure none of them is fills all available disc space. Let me know if after several attempts (use -resume, but avoid chaching by deleting the work dir of QIIME2_DIVERSITY_BETA every time or modify the config) the pipeline still fails with the high core number.
Next step might be: There is a discussion in the QIIME2 forum that might be related with some troubleshooting, see here.
Changing the tmp dir to one that has 130 TB of free space produces the same error except that /tmp now points to the new tmp dir.
Did you figure out the issue or did you find a fix?
No, sorry.
Description of the bug
I'm analyzing a 16S dataset with ~1,200 samples spread across 3 runs (unfortunately this dataset is not public yet so I can't provide a reproducible example). I've found a bug in the ampliseq pipeline where if I run the pipeline with 120 cores it will fail (specifically at the diversity step) but if I run it with 20 cores it will finish without any errors. I believe the issue is that the workflow gets ahead of itself due to the number of cores available and starts a qiime2 module before another requisite qiime2 module finishes.
Command used and terminal output
Relevant files
No response
System information
Nextflow v23.10.1 nf-core/ampliseq v2.10.0 singularity v4.1.3 slurm v23.02.1