Different Singularity behaviours between 22.10.7 vs 23.04.0+

HarryHung commented 2 months ago

Bug report

When users try to run this pipeline with Singularity, it works on Nextflow 22.10.7 and before, but fails on Nextflow 23.04.0 and later (including the latest release).

The failure happens when srst2 within the container try to run Bowtie2, where Bowtie2 attempts to create FIFO files under /tmp via mkfifo. As Singularity mounts /tmp by default, when multiple processes is running srst2, all of them will write their FIFO files to the host /tmp directory.

In Nextflow 22.10.7 and before, the FIFO files have longer file names, e.g. 124813.inpipe1, 124813.inpipe2, 124874.inpipe1, 124874.inpipe2, 124964.inpipe1, 124964.inpipe2, and all is well.

However, in Nextflow 23.04.0 and later, the FIFO files have much shorter file names, e.g. 61.inpipe1, 61.inpipe2, 62.inpipe1, 62.inpipe2, 63.inpipe1, 63.inpipe2. And the relevant processes soon crash due to what I think is namespace conflict, and the error looks like this:

(ERR): mkfifo(/tmp/62.inpipe1) failed.

The only thing changed between my tests is the Nextflow executable version, nothing else. I compared the .command.run and .command.sh between runs, they all look identical (except the work dir paths).

At this point, I am wondering is there some hidden environment variables changed between these Nextflow versions that would affect the behaviours of Singularity?

This seems to be a related issue: https://github.com/nf-core/taxprofiler/issues/422

I am able to work around the issue by forcing each container to use a different subdirectory in /tmp by adding

singularity.runOptions = '-B $(mktemp -d):/tmp'

to nextflow.config

Expected behavior and actual behavior

Expected: Singularity should always behave the same regardless of Nextflow version.
Actual: Latest Nextflow introduce a new bug.

Steps to reproduce the problem

Run the pipeline with -profile sanger

Program output

N/A

Environment

Nextflow version: 22.10.7, 23.04.0, 24.04.4
Java version: OpenJDK 11.04.24, OpenJDK 17.0.12
Operating system: Ubuntu 22.04, Ubuntu 22.04.5
Bash version: GNU bash version 5.1.16(1)-release, zsh 5.8.1 (x86_64-ubuntu-linux-gnu)

Additional context

N/A

pditommaso commented 2 months ago

Which executor are you using?

HarryHung commented 2 months ago

Same issue on both local and LSF.

pditommaso commented 2 months ago

if you cd in the task work directory and execute bash .command.run is the same issue reported?

HarryHung commented 2 months ago

The same issue does NOT occur when I execute bash .command.runwithin the task work directory. (further investigation below the error and output)

This is the original error

ERROR ~ Error executing process > 'GBS_RES:srst2_for_res_typing (1)'

Caused by:
  Missing output file(s) `test*.bam` expected by process `GBS_RES:srst2_for_res_typing (1)`

Command executed:

  srst2 --samtools_args '\-A' --input_pe test_1.fastq.gz test_2.fastq.gz --output test --log --save_scores --min_coverage 99.9 --max_divergence 5 --gene_db GBS_Res_Gene-DB_Final.fasta

  touch test__fullgenes__GBS_Res_Gene-DB_Final__results.txt

Command exit status:
  0

Command output:
    bucket 7: 30%
    bucket 7: 40%
    bucket 7: 50%
    bucket 7: 60%
    bucket 7: 70%
    bucket 7: 80%
    bucket 7: 90%
    bucket 7: 100%
    Sorting block of length 114 for bucket 7
    (Using difference cover)
    Sorting block time: 00:00:00
  Returning block of 115 for bucket 7
  Exited Ebwt loop
  fchr[A]: 0
  fchr[C]: 444
  fchr[G]: 715
  fchr[T]: 1040
  fchr[$]: 1463
  Exiting Ebwt::buildToDisk()
  Returning from initFromVector
  Wrote 4195680 bytes to primary EBWT file: GBS_Res_Gene-DB_Final.fasta.rev.1.bt2
  Wrote 372 bytes to secondary EBWT file: GBS_Res_Gene-DB_Final.fasta.rev.2.bt2
  Re-opening _in1 and _in2 as input streams
  Returning from Ebwt constructor
  Headers:
      len: 1463
      bwtLen: 1464
      sz: 366
      bwtSz: 366
      lineRate: 6
      offRate: 4
      offMask: 0xfffffff0
      ftabChars: 10
      eftabLen: 20
      eftabSz: 80
      ftabLen: 1048577
      ftabSz: 4194308
      offsLen: 92
      offsSz: 368
      lineSz: 64
      sideSz: 64
      sideBwtSz: 48
      sideBwtLen: 192
      numSides: 8
      numLines: 8
      ebwtTotLen: 512
      ebwtTotSz: 512
      color: 0
      reverse: 1
  Total time for backward call to driver() for mirror index: 00:00:00

Command error:
  WARNING: DEPRECATED USAGE: Forwarding SINGULARITYENV_NXF_TASK_WORKDIR as environment variable will not be supported in the future, use APPTAINERENV_NXF_TASK_WORKDIR instead
  Building a SMALL index
  (ERR): mkfifo(/tmp/62.inpipe1) failed.
  Exiting now ...

Work dir:
  /home/ubuntu/local-repo/GBS-Typer-sanger-nf/work/c4/55eacec7ba5627ef369b11d433e025

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

 -- Check '.nextflow.log' file for details

This is the bash .command.run output

❯ cd /home/ubuntu/local-repo/GBS-Typer-sanger-nf/work/c4/55eacec7ba5627ef369b11d433e025
❯ bash .command.run

WARNING: DEPRECATED USAGE: Forwarding SINGULARITYENV_NXF_TASK_WORKDIR as environment variable will not be supported in the future, use APPTAINERENV_NXF_TASK_WORKDIR instead
1224227 reads; of these:
  1224227 (100.00%) were paired; of these:
    1224187 (100.00%) aligned concordantly 0 times
    40 (0.00%) aligned concordantly exactly 1 time
    0 (0.00%) aligned concordantly >1 times
    ----
    1224187 pairs aligned concordantly 0 times; of these:
      174 (0.01%) aligned discordantly 1 time
    ----
    1224013 pairs aligned 0 times concordantly or discordantly; of these:
      2448026 mates make up the pairs; of these:
        2444553 (99.86%) aligned 0 times
        3473 (0.14%) aligned exactly 1 time
        0 (0.00%) aligned >1 times
0.16% overall alignment rate
[samopen] SAM header is present: 19 sequences.
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000

Additional observiations:

I notice that if I use the bash .command.run generated by Nextflow 22.10.7, the FIFO files under /tmp are always have a 4 - 5 digits name, while bash .command.run generated by Nextflow 24.04.4 generated FIFO files with 2 digits name.
My previous assumption that the error is caused by multiple processes seems to be incorrect, as the same error still happen with executor.queueSize = 1 in the nextflow.config. But somehow singularity.runOptions = '-B $(mktemp -d):/tmp' can avoid this error. Maybe the issue is not concurrent process namespace conflict, but later processes are somehow unaware of the existing content in /tmp?

I am not sure what is happening, and please let me know if you need more information.

You can test it out by cloning https://github.com/sanger-bentley-group/GBS-Typer-sanger-nf.git (please test commit d98cb52, as the workaround might be in place for later commits/versions), run (with singularity installed)nextflow run main.nf --reads 'tests/regression_test_data/input_data/*_{1,2}.fastq.gz' --results_dir output -profile sanger to reproduce the error .

pditommaso commented 2 months ago

Very to help without providing a test case to replicate the issue

HarryHung commented 2 months ago

Hi @pditommaso , the last bit of my latest message contains a test case. Thanks!

You can test it out by cloning https://github.com/sanger-bentley-group/GBS-Typer-sanger-nf.git (please test commit d98cb52, as the workaround might be in place for later commits/versions), run (with singularity installed)nextflow run main.nf --reads 'tests/regression_test_data/inputdata/*{1,2}.fastq.gz' --results_dir output -profile sanger to reproduce the error .

pditommaso commented 2 months ago

I cannot pull the full pipeline execution. I need self-contained test case running a single task running this error using the local executor and slurm

HarryHung commented 2 months ago

Sure, I have put together a minimal test case to allow you repliacting the issue with a local executor. It should complete without error with Nextflow 22.10.7, but fail with Nextflow 24.04.4.

You will still need to grab a few essential input data files, by downloading the data directory of this self-contained test case: https://drive.google.com/drive/folders/1XWpx8mU9hQHuCQ6zE-7JOdGsxxABiAqE?usp=sharing)

main.nf

process srst2_for_gbs_res_typing {
    input:
    tuple val(pair_id), file(reads) // ID and paired read files
    path db // File of resistance database file

    output:
    tuple val(pair_id), file("${pair_id}*.bam"), emit: bam_files

    script:
    """
    srst2 --samtools_args '\\-A' --input_pe ${reads[0]} ${reads[1]} --output ${pair_id} --log --save_scores --min_coverage 99.9 --max_divergence 5 --gene_db ${db}
    """
}

process srst2_for_other_res_typing {
    input:
    tuple val(pair_id), file(reads) // ID and paired read files
    path db // File of resistance database file

    output:
    tuple val(pair_id), file("${pair_id}*.bam"), emit: bam_files

    script:
    """
    srst2 --samtools_args '\\-A' --input_pe ${reads[0]} ${reads[1]} --output ${pair_id} --log --save_scores --min_coverage 70 --max_divergence 30 --gene_db ${db}
    """
}

workflow {
    Channel.fromFilePairs( 'data/*_{1,2}.fastq.gz', checkIfExists: true )
        .set { read_pairs_ch }

    gbs_res_typer_db = channel.fromPath('data/GBS_Res_Gene-DB_Final.fasta', checkIfExists: true)
    other_res_db = channel.fromPath('data/ResFinder.fasta', checkIfExists: true)

    srst2_for_gbs_res_typing(read_pairs_ch, gbs_res_typer_db)
    srst2_for_other_res_typing(read_pairs_ch, other_res_db)
}

nextflow.config

process.container = 'bluemoon222/gbs-typer-sanger-nf:0.0.7'

singularity {
    enabled = true
    autoMounts = true
    cacheDir = "$PWD"
}

nextflow-io / nextflow