Nextflow version 23.10.0 is incompatible with QIIME2 (pipeline fails), solution: prepend 'NXF_VER=23.04.4' when running the pipeline

bioinfogaby commented 1 year ago

Description of the bug

FYI, ASV_seqs.fasta is present in the work directory (b12ddbfb05b35bef6eb415cb9f1ef0). I've no clue on the origin of the error.

ERROR ~ Error executing process > 'NFCORE_AMPLISEQ:AMPLISEQ:QIIME2_TAXONOMY:QIIME2_INSEQ (ASV_seqs.fasta)'

Caused by:
  Process `NFCORE_AMPLISEQ:AMPLISEQ:QIIME2_TAXONOMY:QIIME2_INSEQ (ASV_seqs.fasta)` terminated with an error exit status (151)

Command executed:

  qiime tools import \
      --input-path "ASV_seqs.fasta" \
      --type 'FeatureData[Sequence]' \
      --output-path rep-seqs.qza

  cat <<-END_VERSIONS > versions.yml
  "NFCORE_AMPLISEQ:AMPLISEQ:QIIME2_TAXONOMY:QIIME2_INSEQ":
      qiime2: $( qiime --version | sed '1!d;s/.* //' )
  END_VERSIONS

Command exit status:
  151

Command output:
  (empty)

Command error:
  Matplotlib created a temporary config/cache directory at /tmp/matplotlib-38xyr0wz because the default path (/home/qiime2/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
  Traceback (most recent call last):
    File "/opt/conda/envs/qiime2-2023.7/lib/python3.8/site-packages/q2cli/builtin/tools.py", line 266, in import_data
      artifact = qiime2.sdk.Artifact.import_data(type, input_path,
    File "/opt/conda/envs/qiime2-2023.7/lib/python3.8/site-packages/qiime2/sdk/result.py", line 299, in import_data
      pm = qiime2.sdk.PluginManager()
    File "/opt/conda/envs/qiime2-2023.7/lib/python3.8/site-packages/qiime2/sdk/plugin_manager.py", line 67, in __new__
      self._init(add_plugins=add_plugins)
    File "/opt/conda/envs/qiime2-2023.7/lib/python3.8/site-packages/qiime2/sdk/plugin_manager.py", line 105, in _init
      plugin = entry_point.load()
    File "/opt/conda/envs/qiime2-2023.7/lib/python3.8/site-packages/pkg_resources/__init__.py", line 2518, in load
      return self.resolve()
    File "/opt/conda/envs/qiime2-2023.7/lib/python3.8/site-packages/pkg_resources/__init__.py", line 2524, in resolve
      module = __import__(self.module_name, fromlist=['__name__'], level=0)
    File "/opt/conda/envs/qiime2-2023.7/lib/python3.8/site-packages/q2_diversity/__init__.py", line 11, in <module>
      from ._beta import (beta, beta_phylogenetic, bioenv,
    File "/opt/conda/envs/qiime2-2023.7/lib/python3.8/site-packages/q2_diversity/_beta/__init__.py", line 13, in <module>
      from ._beta_rarefaction import beta_rarefaction
    File "/opt/conda/envs/qiime2-2023.7/lib/python3.8/site-packages/q2_diversity/_beta/_beta_rarefaction.py", line 23, in <module>
      from .._ordination import pcoa
    File "/opt/conda/envs/qiime2-2023.7/lib/python3.8/site-packages/q2_diversity/_ordination.py", line 20, in <module>
      import umap as up
    File "/opt/conda/envs/qiime2-2023.7/lib/python3.8/site-packages/umap/__init__.py", line 2, in <module>
      from .umap_ import UMAP
    File "/opt/conda/envs/qiime2-2023.7/lib/python3.8/site-packages/umap/umap_.py", line 41, in <module>
      from umap.layouts import (
    File "/opt/conda/envs/qiime2-2023.7/lib/python3.8/site-packages/umap/layouts.py", line 40, in <module>
      def rdist(x, y):
    File "/opt/conda/envs/qiime2-2023.7/lib/python3.8/site-packages/numba/core/decorators.py", line 234, in wrapper
      disp.enable_caching()
    File "/opt/conda/envs/qiime2-2023.7/lib/python3.8/site-packages/numba/core/dispatcher.py", line 863, in enable_caching
      self._cache = FunctionCache(self.py_func)
    File "/opt/conda/envs/qiime2-2023.7/lib/python3.8/site-packages/numba/core/caching.py", line 601, in __init__
      self._impl = self._impl_class(py_func)
    File "/opt/conda/envs/qiime2-2023.7/lib/python3.8/site-packages/numba/core/caching.py", line 337, in __init__
      raise RuntimeError("cannot cache function %r: no locator available "
  RuntimeError: cannot cache function 'rdist': no locator available for file '/opt/conda/envs/qiime2-2023.7/lib/python3.8/site-packages/umap/layouts.py'

  An unexpected error has occurred:

    cannot cache function 'rdist': no locator available for file '/opt/conda/envs/qiime2-2023.7/lib/python3.8/site-packages/umap/layouts.py'

  See above for debug info.

Work dir:
  /mnt/DATA/bioinfo/projects/AmpliseqDir/work/e5/b12ddbfb05b35bef6eb415cb9f1ef0

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`

 -- Check '.nextflow.log' file for details

Command used and terminal output

$ nextflow run nf-core/ampliseq -r '2.7.0' -profile singularity --input ' /mnt/DATA/bioinfo/projects/AmpliseqDir/sampleSheet_16S.tsv' --FW_primer 'CCTACGGGNGGCWGCAG' --RV_primer 'GACTACHVGGGTATCTAATCC' --metadata ' /mnt/DATA/bioinfo/projects/AmpliseqDir/metadata.tsv' --outdir ampliseq_results_16S_2.7.0 --max_cpus 4 --max_memory 15GB

Relevant files

No response

System information

Nextflow version: 23.10.0 build 5889 Hardware: Desktop Executor: local Container engine: Singularity OS: Linux Version of nf-core/ampliseq: 2.7.0

d4straub commented 1 year ago

Thanks for the report! As it happens, just now I tested that because I also encountered that error. I found the following:

#works:
NXF_VER=21.10.3 nextflow run nf-core/ampliseq -r 2.3.0 -profile test,singularity --outdir results_test_2-3-0 -resume
NXF_VER=23.04.0 nextflow run nf-core/ampliseq -r 2.7.0 -profile test,singularity --outdir results_test_2-7-0 -resume
NXF_VER=23.04.4 nextflow run nf-core/ampliseq -r 2.7.0 -profile test,singularity --outdir results_test_2-7-0 -resume
#fails:
NXF_VER=23.10.0 nextflow run nf-core/ampliseq -r 2.7.0 -profile test,singularity --outdir results_test_2-7-0 -resume

Conclusion: nextflow version 23.10.0 is somehow incompatible. Could you use one of the versions listed above and report back if that solves your issue, e.g. prepend to your command when starting the pipeline NXF_VER=23.04.4.

Edit: The error I encountered was related to Process NFCORE_AMPLISEQ:AMPLISEQ:QIIME2_PREPTAX:QIIME2_EXTRACT (GTGYCAGCMGCCGCGGTAA-GGACTACNVGGGTWTCTAAT) (but also RuntimeError: cannot cache function 'rdist': no locator available for file '/opt/conda/envs/qiime2-2023.7/lib/python3.8/site-packages/umap/layouts.py'), so I conclude the problem is QIIME2 again [this just makes trouble from times to times...].

bioinfogaby commented 1 year ago

Yeap, that worked. Thanks!

muffato commented 1 year ago

Nextflow 23.10 adds the --no-home option when using Singularity. Maybe this tool wanted to cache data under the home directory ?

d4straub commented 1 year ago

Hm quite possible. How to test that?

docker instead of singularity should work
make the tool use another cache

Correct?

MatthiasZepper commented 1 year ago

Nextflow 23.10 adds the --no-home option when using Singularity. Maybe this tool wanted to cache data under the home directory ?

That is exactly what the error message states, no?

Matplotlib created a temporary config/cache directory at /tmp/matplotlib-38xyr0wz 
because the default path (/home/qiime2/matplotlib) is not a writable directory; 
it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, 
in particular to speed up the import of Matplotlib and to better support multiprocessing.

In previous versions, /home was mounted inside the container and writeable, so /home/qiime2/matplotlib could be used for caching.

d4straub commented 1 year ago

I just wanted to confirm the problem on my laptop (to later run with docker; I have installed singularity & docker) with NXF_VER=23.10.0 nextflow run nf-core/ampliseq -r 2.7.0 -profile test,singularity --outdir results_test_2-7-0 and it succeeded. So it seems to differ between systems.

So I tried all 3 systems that I have available at the moment:

Failing system (Workstation)

Nextflow version: 23.10.0.5889
Java version: openjdk 17.0.9 2023-10-17 LTS, openJDK Runtime Environment (Red_Hat-17.0.9.0.9-1) (build 17.0.9+9-LTS)
Operating system: Linux 5.14.0-284.30.1.el9_2.x86_64 x86_64
Bash version: GNU bash, version 5.1.8(1)-release (x86_64-redhat-linux-gnu)
Singularity: singularity version 3.8.7-1.el9

Succeeding system (Laptop):

Nextflow version: 23.10.0.5889
Java version: openjdk 11.0.20.1 2023-08-24, OpenJDK Runtime Environment (build 11.0.20.1+1-post-Ubuntu-0ubuntu120.04)
Operating system: Linux 5.15.0-87-generic x86_64
Bash version: GNU bash, version 5.0.17(1)-release (x86_64-pc-linux-gnu)
Singularity: singularity-ce version 3.10.5-focal

Succeeding system (hpc):

Nextflow version: 23.10.0.5889
Java version: openjdk 17.0.5 2022-10-18 LTS, OpenJDK Runtime Environment (Red_Hat-17.0.5.0.8-2.el8_6) (build 17.0.5+8-LTS)
Operating system: Linux 4.18.0-372.32.1.el8_6.x86_64 x86_64
Bash version: GNU bash, version 4.4.20(1)-release (x86_64-redhat-linux-gnu)
Singularity: singularity version 3.8.7-1.el8

muffato commented 1 year ago

That is exactly what the error message states, no?
Matplotlib created a temporary config/cache directory at /tmp/matplotlib-38xyr0wz 
because the default path (/home/qiime2/matplotlib) is not a writable directory; 
it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, 
in particular to speed up the import of Matplotlib and to better support multiprocessing.
In previous versions, /home was mounted inside the container and writeable, so /home/qiime2/matplotlib could be used for caching.

Great. In that case, I would imagine two workarounds:

In the module, do something like export MPLCONFIGDIR=$PWD to keep the cache local to the work directory, which is already mounted read+write
Add an optional input channel to the module for a cache directory, which Nextflow would properly stage and mount, and use it in export MPLCONFIGDIR (and default to $PWD)

d4straub commented 1 year ago

Thanks @muffato & @MatthiasZepper for your intrest and suggestions! I added export MPLCONFIGDIR="${PWD}/HOME" to all processes that use QIIME2, and indeed the part with problematic MPLCONFIGDIR is solved, however the process still fails. Matplotlib wasnt the root of the problem it seems to me.

Here is the complete error message for NXF_VER=23.10.0 nextflow run d4straub/ampliseq -r fix-NXF_VER=23.10.0 -profile test,singularity --outdir results:

ERROR ~ Error executing process > 'NFCORE_AMPLISEQ:AMPLISEQ:QIIME2_PREPTAX:QIIME2_EXTRACT (GTGYCAGCMGCCGCGGTAA-GGACTACNVGGGTWTCTAAT)'

Caused by:
  Process `NFCORE_AMPLISEQ:AMPLISEQ:QIIME2_PREPTAX:QIIME2_EXTRACT (GTGYCAGCMGCCGCGGTAA-GGACTACNVGGGTWTCTAAT)` terminated with an error exit status (1)

Command executed:

  export XDG_CONFIG_HOME="${PWD}/HOME"
  export MPLCONFIGDIR="${PWD}/HOME"

     ### Import
     qiime tools import \
         --type 'FeatureData[Sequence]' \
         --input-path greengenes85.fna \
         --output-path ref-seq.qza
     qiime tools import \
         --type 'FeatureData[Taxonomy]' \
         --input-format HeaderlessTSVTaxonomyFormat \
         --input-path greengenes85.tax \
         --output-path ref-taxonomy.qza
     #Extract sequences based on primers
     qiime feature-classifier extract-reads \
         --i-sequences ref-seq.qza \
         --p-f-primer GTGYCAGCMGCCGCGGTAA \
         --p-r-primer GGACTACNVGGGTWTCTAAT \
         --o-reads GTGYCAGCMGCCGCGGTAA-GGACTACNVGGGTWTCTAAT-ref-seq.qza \
         --quiet

     cat <<-END_VERSIONS > versions.yml
     "NFCORE_AMPLISEQ:AMPLISEQ:QIIME2_PREPTAX:QIIME2_EXTRACT":
         qiime2: $( qiime --version | sed '1!d;s/.* //' )
     END_VERSIONS

Command exit status:
  1

Command output:
  (empty)

Command error:
  Traceback (most recent call last):
    File "/opt/conda/envs/qiime2-2023.7/lib/python3.8/site-packages/q2cli/builtin/tools.py", line 266, in import_data
      artifact = qiime2.sdk.Artifact.import_data(type, input_path,
    File "/opt/conda/envs/qiime2-2023.7/lib/python3.8/site-packages/qiime2/sdk/result.py", line 299, in import_data
      pm = qiime2.sdk.PluginManager()
    File "/opt/conda/envs/qiime2-2023.7/lib/python3.8/site-packages/qiime2/sdk/plugin_manager.py", line 67, in __new__
      self._init(add_plugins=add_plugins)
    File "/opt/conda/envs/qiime2-2023.7/lib/python3.8/site-packages/qiime2/sdk/plugin_manager.py", line 105, in _init
      plugin = entry_point.load()
    File "/opt/conda/envs/qiime2-2023.7/lib/python3.8/site-packages/pkg_resources/__init__.py", line 2518, in load
      return self.resolve()
    File "/opt/conda/envs/qiime2-2023.7/lib/python3.8/site-packages/pkg_resources/__init__.py", line 2524, in resolve
      module = __import__(self.module_name, fromlist=['__name__'], level=0)
    File "/opt/conda/envs/qiime2-2023.7/lib/python3.8/site-packages/q2_diversity/__init__.py", line 11, in <module>
      from ._beta import (beta, beta_phylogenetic, bioenv,
    File "/opt/conda/envs/qiime2-2023.7/lib/python3.8/site-packages/q2_diversity/_beta/__init__.py", line 13, in <module>
      from ._beta_rarefaction import beta_rarefaction
    File "/opt/conda/envs/qiime2-2023.7/lib/python3.8/site-packages/q2_diversity/_beta/_beta_rarefaction.py", line 23, in <module>
      from .._ordination import pcoa
    File "/opt/conda/envs/qiime2-2023.7/lib/python3.8/site-packages/q2_diversity/_ordination.py", line 20, in <module>
      import umap as up
    File "/opt/conda/envs/qiime2-2023.7/lib/python3.8/site-packages/umap/__init__.py", line 2, in <module>
      from .umap_ import UMAP
    File "/opt/conda/envs/qiime2-2023.7/lib/python3.8/site-packages/umap/umap_.py", line 41, in <module>
      from umap.layouts import (
    File "/opt/conda/envs/qiime2-2023.7/lib/python3.8/site-packages/umap/layouts.py", line 40, in <module>
      def rdist(x, y):
    File "/opt/conda/envs/qiime2-2023.7/lib/python3.8/site-packages/numba/core/decorators.py", line 234, in wrapper
      disp.enable_caching()
    File "/opt/conda/envs/qiime2-2023.7/lib/python3.8/site-packages/numba/core/dispatcher.py", line 863, in enable_caching
      self._cache = FunctionCache(self.py_func)
    File "/opt/conda/envs/qiime2-2023.7/lib/python3.8/site-packages/numba/core/caching.py", line 601, in __init__
      self._impl = self._impl_class(py_func)
    File "/opt/conda/envs/qiime2-2023.7/lib/python3.8/site-packages/numba/core/caching.py", line 337, in __init__
      raise RuntimeError("cannot cache function %r: no locator available "
  RuntimeError: cannot cache function 'rdist': no locator available for file '/opt/conda/envs/qiime2-2023.7/lib/python3.8/site-packages/umap/layouts.py'

  An unexpected error has occurred:

    cannot cache function 'rdist': no locator available for file '/opt/conda/envs/qiime2-2023.7/lib/python3.8/site-packages/umap/layouts.py'

  See above for debug info.

Work dir:
  /home/bcgsd01/test_ampliseq/work/08/8a446d3afbec175518f6be0768a7d6

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line

 -- Check '.nextflow.log' file for details

MatthiasZepper commented 1 year ago

I think /tmp is the better choice nonetheless. What happens with this setting?

export MPLCONFIGDIR="/tmp/mplconfigdir"
export NUMBA_CACHE_DIR="/tmp/numbacache"

d4straub commented 1 year ago

Great, thanks, that did work indeed! Why would /tmp be better? Is that common practice in nf-core (I am not aware of an example, do you)? I just worry a bit about clouds (cannot test that, except AWS test).

d4straub commented 1 year ago

I checked for examples in pipelines and nf-core/modules and found only very few examples:

nf-core/modules:

nf-core/marsseq:

export TMPDIR=/tmp

nf-core/rnafusion

export TMPDIR=/tmp

So nf-core modules seem to put tmp and home into the work dir ., while the two local modules in pipelines use /tmp. This pipeline uses export XDG_CONFIG_HOME="\${PWD}/HOME", e.g. here.

Not sure what to conclude here, so different solutions :)

muffato commented 1 year ago

Another example for your list, @d4straub : ./tmp is this module. The reason for choosing ./tmp instead of letting the tool use /tmp is that when the job is killed on a HPC, it would then leave some leftover files on /tmp which take space and may get in the way of other users / processes. By keeping the files local to ./tmp, it's fully within the user's own work directory, which the user can easily clean up.

In your case, and I know nothing about the tool so I'm just guessing, if it's a "cache", then presumably the tool will not clean it up at the end, since the purpose of a cache is to keep files around for the next run. Using /tmp and running the pipeline on a HPC would mean that likely, every run would hit a different compute node and the cluster may accumulate caches on all compute nodes over time ! That is not the purpose of /tmp :)

So either consider it a purely temporary necessity and use ./tmp, or make it a proper pipeline parameter that the user can set to wherever they want and reuse between runs. (by the way, would it even make sense to share the cache between pipeline runs ?)

d4straub commented 1 year ago

Thanks for that great explanation! I'll test ./tmp to make sure it'll work fine.

I researched where the older export code is coming from in ampliseq and I found the addition of it in https://github.com/nf-core/ampliseq/pull/163. In that PR was a change from export HOME=./HOME to export HOME="\${PWD}/HOME" here, because @skrakau suggested "Better not use relative paths". Since there are some examples now with relative paths and I found none with absolute paths, I am preferring the relative paths. Or are you aware of any problems regarding relative paths Sabrina?

MatthiasZepper commented 1 year ago

I think the nf-core #modules channel would be the appropriate place to get some more input on the issue.

I am certainly not an expert in this matter, but @muffato 's explanation strikes me as strange. /tmp is the default path within Linux file system structure for temporary files. Its sole purpose is to offer applications a dedicated place to store temporary files. These files are generally deleted whenever the system is restarted and may be deleted at any time by utilities such as tmpwatch. On HPC systems /tmp is usually configured to mount a specific scratch file system that is better suited than the ordinary distributed file systems for quick caching & writes and cleaned-up when needed.

The reason why you do not find many modules or pipelines with explicit configuration is, that many tools just use /tmp for caching files and the respective container technologies mount the hosts /tmp there. All the respective config happens on the profile level and Nextflow has corresponding config options for the different executors (e.g. for AWS) and container technologies.

In summary, I advocate using /tmp, since it is the path that is specifically meant for this purpose and also configured accordingly on the different executors.

d4straub commented 1 year ago

Thanks a lot for your input! I did get the feedback however that /tmp is not a good choice for example on our hpc because it is using a scratch system that is not necessarily connected to /tmp. Therefore I was advised to use ./tmp or similar (which will then use the scratch system as intended). I did check the size of those folder that I want to redirect and its just a couple of MB at most. So I'm going to use ./<folder> for now.

muffato commented 1 year ago

Fair comment @MatthiasZepper . You're absolutely right regarding what is best practice on a HPC and what you're describing is exactly how things work on ours. One of the problems we're seeing is that tmpwatch is not running soon enough and we're often running out of space on /tmp (which causes us other problems like nodes having less RAM, etc). But I think it's reasonable to say it's our issue and I probably over-reached in my previous comment.

I was recently wondering if I could force something like export TMPDIR=$PWD/tmp at the beginning of each Nextflow job, rather than having to change every module, but I didn't know how. Thanks to your links, I've found process.beforeScript which seems to do exactly what I was looking for.

Coming back to the original error, we're looking for a place to let the tool record some cache files that it won't delete at the end. My view is still that either it is treated as a proper cache and is made a proper Nextflow parameter that the user can select, will be staged, and can be reused between processes / runs. Or, consider it like temporary files that can be trashed after the run. In the latter case, I wouldn't support making it use tmp (or $TMPDIR) without directly deleting the files after. I think the module should clean after itself. And actually, it should probably do that even if you end up using a local directory under the job directory.

d4straub commented 1 year ago

Thanks to all of you here, the fix is in dev branch now, will be in the next release. I close here but feel free to open another issue if you encounter any other problems.

nf-core / ampliseq