Nf-core/mag does not recognize the cache when starting the pipeline

pdy1084 commented 4 weeks ago

Description of the bug

Hello,

I have been running the Nf-core/mag pipeline for some time without any problem but recently I wanted to test one of the most recent stable releases (3.0.3). The problem however is that I cannot manage to make Nf-core/mag recognize the previous cache. Some of the steps were completed successfully in the previous runs, like for example the assembly with Megahit but all former outputs are not detected and the pipeline starts from scratch. (I know that this specific run failed due to memory allocation of Megahit but the main issue is that it should not be starting to run Megahit). In the last runs I started, the pipeline is starting from the beginning wanting to run the assembly again, as you can see in the execution trace (also in the image attached) -I filtered the execution traces doing grep NFCORE_MAG:MAG:MEGAHIT execution*txt-:

execution_trace_2024-01-31_14-57-13.txt:7       43/54ba62       3801928 NFCORE_MAG:MAG:MEGAHIT (W1)     FAILED  1       2024-01-31 14:57:27.054  23h 34m 44s     23h 34m 42s     1936.7% 101.2 GB        113 GB  1.1 TB  605.2 GB
execution_trace_2024-02-01_16-31-09.txt:6       9e/82eb6b       1592758 NFCORE_MAG:MAG:MEGAHIT (W1)     COMPLETED       0       2024-02-01 16:31:31.119  1d 23m 50s      1d 23m 48s      1953.9% 101.2 GB        112.9 GB        1.1 TB  633 GB
execution_trace_2024-03-15_15-20-11.txt:6       9e/82eb6b       1592758 NFCORE_MAG:MAG:MEGAHIT (W1)     CACHED  0       2024-02-01 16:31:31.119  1d 23m 50s      1d 23m 48s      1953.9% 101.2 GB        112.9 GB        1.1 TB  633 GB
execution_trace_2024-03-18_14-29-41.txt:6       9e/82eb6b       1592758 NFCORE_MAG:MAG:MEGAHIT (W1)     CACHED  0       2024-02-01 16:31:31.119  1d 23m 50s      1d 23m 48s      1953.9% 101.2 GB        112.9 GB        1.1 TB  633 GB
execution_trace_2024-08-28_17-26-05.txt:6       9e/82eb6b       1592758 NFCORE_MAG:MAG:MEGAHIT (W1)     CACHED  0       2024-02-01 16:31:31.119  1d 23m 50s      1d 23m 48s      1953.9% 101.2 GB        112.9 GB        1.1 TB  633 GB
execution_trace_2024-08-28_17-27-28.txt:6       9e/82eb6b       1592758 NFCORE_MAG:MAG:MEGAHIT (W1)     CACHED  0       2024-02-01 16:31:31.119  1d 23m 50s      1d 23m 48s      1953.9% 101.2 GB        112.9 GB        1.1 TB  633 GB
execution_trace_2024-10-01_17-24-46.txt:7       31/ec952c       3000072 NFCORE_MAG:MAG:MEGAHIT (W1)     FAILED  247     2024-10-01 17:25:17.016  9h 40m 56s      9h 40m 56s      -       -       -       -       -
execution_trace_2024-10-02_19-21-52.txt:7       63/9a9367       3268086 NFCORE_MAG:MAG:MEGAHIT (W1)     FAILED  250     2024-10-02 19:22:15.439  16m 2s  16m 2s  -       -       -       -       -
execution_trace_2024-10-02_19-21-52.txt:14      9f/a9fffd       3704177 NFCORE_MAG:MAG:MEGAHIT (W1)     FAILED  250     2024-10-02 19:38:18.152  14m 23s 14m 22s -       -       -       -       -
execution_trace_2024-10-02_19-21-52.txt:15      b1/495d7e       4063325 NFCORE_MAG:MAG:MEGAHIT (W1)     FAILED  250     2024-10-02 19:52:40.778  14m 11s 14m 11s -       -       -       -       -
execution_trace_2024-10-02_19-21-52.txt:16      cf/33f48a       224172  NFCORE_MAG:MAG:MEGAHIT (W1)     FAILED  250     2024-10-02 20:06:52.362  14m 16s 14m 15s -       -       -       -       -
execution_trace_2024-10-04_11-45-20.txt:7       7a/40325c       769771  NFCORE_MAG:MAG:MEGAHIT (W1)     FAILED  247     2024-10-04 11:45:37.900  15h 11m 18s     15h 11m 18s     -       -       -       -       -

In the nf-core/mag command I added the -resume flag and I tried to set the NFX cache dir to the run folder where Megahit was cached (execution_trace_2024-08-28_17-27-28.txt:6), using 'export NXF_SINGULARITY_CACHEDIR=/path/to/sample/W1/work/9e/82eb6bc3b971b545bb014c51400678' However this does not seem to work either as the new execution trace shows MEGAHIT still failed and that any other output has been cached.

So how could I manage to assign the right cache dir if it does not seem to work for me with the environmental variable NXF_SINGULARITY_CACHEDIR? In addition, if I have around 20 more samples for which I should do the same process. Is there a way to automatize the step of assigning the proper directory for multiple cache runs, choosing the cache corresponding to the previous run that was more complete...?

In overall, how can I point in each Nf-core/mag run the folder cache which is the most complete for each specific sample?

Thank you very much.

Command used and terminal output

-----> Run command:

I am running the command from this directory: /path/to/sample/W1/output

echo "running nextflow nfcore/mag, resuming"
        # Run nf-core
        nextflow run nf-core/mag -r 3.0.3 --input /path/to/sample/W${1}/input_files/samplesheet_test.csv \
         --outdir /path/to/sample/W${1}/results \
         -profile singularity \
         --multiqc_title multiqc_sample_W${1}  \
         -c /path/to/sample/configs.conf \
         -params-file /path/to/sample/nf-params_mags.json -resume

-----> Error (Megahit is not cached):

executor >  local (13)
[44/3f9539] NFC…G:FASTQC_RAW (W1_run0_raw) | 1 of 1 ✔
[-        ] NFCORE_MAG:MAG:CAT_FASTQ       -
[-        ] NFCORE_MAG:MAG:NANOPLOT_RAW    -
[-        ] NFC…_MAG:MAG:PORECHOP_PORECHOP -
[-        ] NFCORE_MAG:MAG:NANOLYSE        -
[-        ] NFCORE_MAG:MAG:FILTLONG        -
[-        ] NFC…_MAG:MAG:NANOPLOT_FILTERED -
[bb/9b213b] NFC… (p_compressed+h+v.tar.gz) | 1 of 1 ✔
[70/8bcbdf] NFC…CENTRIFUGE_CENTRIFUGE (W1) | 1 of 1 ✔
[96/4577f4] NFC…AG:CENTRIFUGE_KREPORT (W1) | 1 of 1 ✔
[14/d01661] NFC…MAG:KRAKEN2_DB_PREPARATION | 1 of 1 ✔
[e2/cdee47] NFC…(W1-minikraken_8GB_202003) | 1 of 1 ✔
[b8/0d075f] NFCORE_MAG:MAG:KRONA_KRONADB   | 1 of 1 ✔
[4d/25f72b] NFC…PORT2KRONA_CENTRIFUGE (W1) | 1 of 1 ✔
[91/8b0c0f] NFC…RONA_KTIMPORTTAXONOMY (W1) | 2 of 2 ✔
[7a/40325c] NFCORE_MAG:MAG:MEGAHIT (W1)    | 1 of 1, failed: 1 ✘
[-        ] NFCORE_MAG:MAG:QUAST           -
[-        ] NFCORE_MAG:MAG:PRODIGAL        -
[-        ] NFC…ION:BOWTIE2_ASSEMBLY_BUILD -
[-        ] NFC…ION:BOWTIE2_ASSEMBLY_ALIGN -
[69/26e8e8] NFC…eria_odb10.2020-03-06.tar) | 1 of 1 ✔
[-        ] NFC…MAG:BUSCO_QC:BUSCO_SUMMARY | 0 of 1
[e2/076eee] NFC… (gtdbtk_r214_data.tar.gz) | 1 of 1 ✔
Plus 27 more processes waiting for tasks…
Execution cancelled -- Finishing pending tasks before exit
-[nf-core/mag] Sent summary e-mail to rodo@izw-berlin.de (sendmail)-
-[nf-core/mag] Pipeline completed with errors-
ERROR ~ Error executing process > 'NFCORE_MAG:MAG:MEGAHIT (W1)'

Caused by:
  Process `NFCORE_MAG:MAG:MEGAHIT (W1)` terminated with an error exit status (247)

Command executed:

  ## Check if we're in the same work directory as a previous failed MEGAHIT run
  if [[ -d MEGAHIT ]]; then
      rm -r MEGAHIT/
  fi

  megahit  -t "3" -m 8589934592 -1 "W1_hq_norm_reads_R1.fq.gz" -2 "W1_hq_norm_reads_R2.fq.gz" -o MEGAHIT --out-prefix "MEGAHIT-W1"

  gzip -c "MEGAHIT/MEGAHIT-W1.contigs.fa" > "MEGAHIT/MEGAHIT-W1.contigs.fa.gz"

  cat <<-END_VERSIONS > versions.yml
  "NFCORE_MAG:MAG:MEGAHIT":
      megahit: $(echo $(megahit -v 2>&1) | sed 's/MEGAHIT v//')
  END_VERSIONS

Command exit status:
  247

Command output:
  (empty)

Command error:
  INFO:    Environment variable SINGULARITYENV_TMPDIR is set, but APPTAINERENV_TMPDIR is preferred
  INFO:    Environment variable SINGULARITYENV_NXF_TASK_WORKDIR is set, but APPTAINERENV_NXF_TASK_WORKDIR is preferred
  INFO:    Environment variable SINGULARITYENV_NXF_DEBUG is set, but APPTAINERENV_NXF_DEBUG is preferred
  2024-10-04 11:45:39 - MEGAHIT v1.2.9
  2024-10-04 11:45:39 - Using megahit_core with POPCNT and BMI2 support
  2024-10-04 11:45:39 - Convert reads to binary library
  2024-10-04 11:59:17 - b'INFO  sequence/io/sequence_lib.cpp  :   75 - Lib 0 (W1_hq_norm_reads_R1.fq.gz,W1_hq_norm_reads_R2.fq.gz): pe, 518686480 reads, 150 max length'
  2024-10-04 11:59:17 - b'INFO  utils/utils.h                 :  152 - Real: 817.8321\tuser: 423.4995\tsys: 119.3686\tmaxrss: 245760'
  2024-10-04 11:59:17 - k-max reset to: 141
  2024-10-04 11:59:17 - Start assembly. Number of CPU threads 3
  2024-10-04 11:59:17 - k list: 21,29,39,59,79,99,119,141
  2024-10-04 11:59:17 - Memory used: 8589934592
  2024-10-04 11:59:17 - Extract solid (k+1)-mers for k = 21
  2024-10-04 23:39:43 - Build graph for k = 21
  2024-10-05 02:56:53 - Error occurs, please refer to MEGAHIT/MEGAHIT-W1.log for detail
  2024-10-05 02:56:53 - Command: /usr/local/bin/megahit_core seq2sdbg --host_mem 8589934592 --mem_flag 1 --output_prefix MEGAHIT/tmp/k21/21 --num_cpu_threads 3 -k 21 --kmer_from 0 --input_prefix MEGAHIT/tmp/k21/21 --need_mercy; Exit code -9

Work dir:
    /path/to/sample/W1/work/7a/40325cb7e0537eb6e0b459c8d7cd2c        

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

 -- Check '.nextflow.log' file for details
ERROR ~ Pipeline failed. Please refer to troubleshooting docs: https://nf-co.re/docs/usage/troubleshooting

 -- Check '.nextflow.log' file for details

Relevant files

nextflow.log

System information

nextflow version 24.04.4.5917 HPC slurm Singularity nf-core/mag 3.0.3

OS information: PRETTY_NAME="Ubuntu 22.04.4 LTS" NAME="Ubuntu" VERSION_ID="22.04" VERSION="22.04.4 LTS (Jammy Jellyfish)" VERSION_CODENAME=jammy ID=ubuntu ID_LIKE=debian

jfy133 commented 3 weeks ago

Hi @pdy1084 there is sort of a lot to unpack here, and I think a little bit of conceptual confusion on your side, I hope I can be helpful though:

When you say you are unable to resume with the cache, do you mean you run once with an older version of mag, and then you try to resume but with a more recent version (i.e., running same command but updating to -r 3.0.3), or you you mean you are unable to resume between different failed runs of the same -r 3.0.3 version?
I think conceptually you're mixing up different caches, and maybe why it's not resuming correctly:
- NXF_SINGULARITY_CACHE is not the same as the pipelune Run cache.
- NXF_SINGULARITY_CACHE is for storing the software images used in each step of the pipeline
- The -resume functionality 'cache' (i.e. for previous steps of the pipeline) is stored in -work. But if you `-resume in the same location as you ran the first command, it should pick this up. If you have changed the singularity cache, it'll store the images somewhere else and the pipeline might think everything has changed in the pipeline and will start form scratch

You cannot/should not change the -w working directory (by default set to wherever you execute your command) fore example, as that will cause -resume to break.

Also I see you have a config file, have you set cleanup = true in it? That would also delete all the files within the working directory, and cause -resume to break

pdy1084 commented 3 weeks ago

Hi @jfy133 ,

Thank you for your feedback about the issue. It is true that I had some confusion about the types of cache and how to manage them to resume the pipeline correctly.

Regarding your questions:

My case would be the first one, the one you described as "run once with an older version of mag, and then you try to resume but with a more recent version (i.e., running same command but updating to -r 3.0.3)".
Now I understand that NXF_SINGULARITY_CACHE stores the software images but it is not pointing to the place where the images will be collected by the mag pipeline using -resume. So then I suppose that in the next run I will need to remove the line export NXF_SINGULARITY_CACHE="..." and somehow I would need to tell Nf-core/mag the right folder where to collect the cache, right?
It seems that the third bullet point in your second comment could be the reason of the -resume failure.
Finally I can confirm that I did not use the -w flag to change the working directory neither I used cleanup = true in the config file to delete all the files within the working directory.

I hope it is now easier to track and address the issue. Please let me know if you need more information.

jfy133 commented 3 weeks ago

Thanks for the clarifications!

To my knowledge, changing the entire verison e.g. by changing -r will cause Nextflow to consider that the entire workflow is changed and cannot trust particular parts of the pipeline are the same - so negates the cache, thus indeed will start from scratch.

You'll just have to run from scratch in this case to compare, sorry about that.

jfy133 commented 3 weeks ago

Please feel free to reopen if that's not your observation of the behaviour

nf-core / mag