nextflow-io / nextflow

A DSL for data-driven computational pipelines
http://nextflow.io
Apache License 2.0
2.78k stars 634 forks source link

GCS doesn't support symlinks, so Google Batch executor should use hardlinks #4845

Open harper357 opened 8 months ago

harper357 commented 8 months ago

Bug report

(Please follow this template replacing the text between parentheses with the requested information)

Expected behavior and actual behavior

When running a pipeline (in this instance nf-core/quantms) using input files hosted on GCS and a workdir on GCS, Nextflow attempts to stage the files using a symlink, but GCS doesn't seem to support them.

Steps to reproduce the problem

Using the nf-core/quantms pipeline and test_dia profile, tasks 1 and 2 work fine.

Now try hosting the seed file from the test_dia profile on GCS and run the pipeline with it as the input, task 1 completes, but task 2 fails. The error report states that the input file for task 2 is missing.

Adding in stageInMode = 'link' fixes the issue.

Program output

Error executing process > 'NFCORE_QUANTMS:QUANTMS:CREATE_INPUT_CHANNEL:SDRFPARSING (PXD026600.sdrf.tsv)'

Caused by:
  Process `NFCORE_QUANTMS:QUANTMS:CREATE_INPUT_CHANNEL:SDRFPARSING (PXD026600.sdrf.tsv)` terminated with an error exit status (1)

Command executed:

  ## -t2 since the one-table format parser is broken in OpenMS2.5
  ## -l for legacy behavior to always add sample columns

  parse_sdrf convert-openms \
      -t2 -l \
      --extension_convert raw:mzML,.gz:,.tar.gz:,.tar:,.zip: \
      -s PXD026600.sdrf.tsv \
       \
      2>&1 | tee PXD026600.sdrf_parsing.log

  mv openms.tsv PXD026600.sdrf_config.tsv
  mv experimental_design.tsv PXD026600.sdrf_openms_design.tsv

  cat <<-END_VERSIONS > versions.yml
  "NFCORE_QUANTMS:QUANTMS:CREATE_INPUT_CHANNEL:SDRFPARSING":
      sdrf-pipelines: $(parse_sdrf --version 2>&1 | awk -F ' ' '{print $2}')
  END_VERSIONS

Command exit status:
  1

Command output:
      OpenMS().openms_convert(sdrf, onetable, legacy, verbose, conditionsfromcolumns, extension_convert)
    File "/usr/local/lib/python3.11/site-packages/sdrf_pipelines/openms/openms.py", line 242, in openms_convert
      sdrf = pd.read_table(sdrf_file)
             ^^^^^^^^^^^^^^^^^^^^^^^^
    File "/usr/local/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 1282, in read_table
      return _read(filepath_or_buffer, kwds)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/usr/local/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 611, in _read
      parser = TextFileReader(filepath_or_buffer, **kwds)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/usr/local/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 1448, in __init__
      self._engine = self._make_engine(f, self.engine)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/usr/local/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 1705, in _make_engine
      self.handles = get_handle(
                     ^^^^^^^^^^^
    File "/usr/local/lib/python3.11/site-packages/pandas/io/common.py", line 863, in get_handle
      handle = open(
               ^^^^^
  FileNotFoundError: [Errno 2] No such file or directory: 'PXD026600.sdrf.tsv'

Environment

Launched from Tower to Google Batch

Additional context

(Add any other context about the problem here)

Update: It looks like I forgot to push the stageInMode = 'link' commit when before I did my final test, so instead I tested that stageInMode = 'copy' fixes the issue.

bentsherman commented 8 months ago

It looks like gcsfuse has supported symlinks for a while: https://github.com/GoogleCloudPlatform/gcsfuse/issues/12 . Maybe there was a regression?

The staging command for google batch is defined here:

https://github.com/nextflow-io/nextflow/blob/82de4bfe726da274999cb6a5e666320df2a6f18d/modules/nextflow/src/main/groovy/nextflow/executor/SimpleFileCopyStrategy.groovy#L216-L231

@harper357 can you give me a directory listing for a task directory when using symlink vs link? Just do an ls -al in the task script. I'm guessing that symlink'ed files are just not showing up for some reason

harper357 commented 8 months ago

Sorry, I am a little confused by your ask, I am using a GCS bucket as the workdir (as per the documentation) so I and I am not sure how I how I would catch the worker node in time to ssh in time before it crashes.

Are you asking for the GCS directory listing?

On Mon, Mar 25, 2024, 6:09 AM Ben Sherman @.***> wrote:

It looks like gcsfuse has supported symlinks for a while: GoogleCloudPlatform/gcsfuse#12 https://github.com/GoogleCloudPlatform/gcsfuse/issues/12 . Maybe there was a regression?

The staging command for google batch is defined here:

https://github.com/nextflow-io/nextflow/blob/82de4bfe726da274999cb6a5e666320df2a6f18d/modules/nextflow/src/main/groovy/nextflow/executor/SimpleFileCopyStrategy.groovy#L216-L231

@harper357 https://github.com/harper357 can you give me a directory listing for a task directory when using symlink vs link? Just do an ls -al in the task script. I'm guessing that symlink'ed files are just not showing up for some reason

— Reply to this email directly, view it on GitHub https://github.com/nextflow-io/nextflow/issues/4845#issuecomment-2017974743, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZ4ZB2U44HIL23M7FQH34TY2AOYNAVCNFSM6AAAAABFEAJKWGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJXHE3TINZUGM . You are receiving this because you were mentioned.Message ID: @.***>

bentsherman commented 8 months ago

I was thinking to just add ls -al to the process script, then you should see the directory listing in the error message you showed

harper357 commented 8 months ago

Task 1 (the task that works):

total 20
drwx------ 2 root root 4096 Mar 25 17:26 .
drwxrwxrwt 1 root root 4096 Mar 25 17:26 ..
-rw-r--r-- 1 root root    0 Mar 25 17:26 .command.err
-rw-r--r-- 1 root root    0 Mar 25 17:26 .command.out
lrwxrwxrwx 1 root root   87 Mar 25 17:26 .command.run -> /mnt/disks/[private_bucket]/nextflow/work/7a/2c6aede9e1e5e94c59563adce61992/.command.run
lrwxrwxrwx 1 root root   86 Mar 25 17:26 .command.sh -> /mnt/disks/[private_bucket]/nextflow/work/7a/2c6aede9e1e5e94c59563adce61992/.command.sh
-rw-r--r-- 1 root root    0 Mar 25 17:26 .command.trace
lrwxrwxrwx 1 root root   93 Mar 25 17:26 PXD026600.sdrf.tsv -> /mnt/disks/[private_bucket]/[input_path]/PXD026600.sdrf.tsv

Task 2 (the task that crashes):

total 20
drwx------ 2 root root 4096 Mar 25 17:28 .
drwxrwxrwt 1 root root 4096 Mar 25 17:28 ..
-rw-r--r-- 1 root root    0 Mar 25 17:28 .command.err
-rw-r--r-- 1 root root    0 Mar 25 17:28 .command.out
lrwxrwxrwx 1 root root   87 Mar 25 17:28 .command.run -> /mnt/disks/[private_bucket]/nextflow/work/ca/919bfbc47b8a94c479636857ed9b60/.command.run
lrwxrwxrwx 1 root root   86 Mar 25 17:28 .command.sh -> /mnt/disks[private_bucket]/nextflow/work/ca/919bfbc47b8a94c479636857ed9b60/.command.sh
-rw-r--r-- 1 root root    0 Mar 25 17:28 .command.trace
lrwxrwxrwx 1 root root   93 Mar 25 17:28 PXD026600.sdrf.tsv -> /mnt/disks/[private_bucket]/nextflow/work/7a/2c6aede9e1e5e94c59563adce61992/PXD026600.sdrf.tsv
bentsherman commented 8 months ago

How is the [input_path] different in the first example?

harper357 commented 8 months ago

It is just a couple of subfolders on GCS. I doubled check that it is actually the correct path. In other words, in Task 1 it points to the file on GCS, in Task 2 it points to the symlink from Task 1.

harper357 commented 8 months ago

Small correction, I never pushed the stageInMode='link', so the correction that worked for me was stageInMode='copy'

bentsherman commented 7 months ago

@hnawar @soj-hub Do either of you know anything about symlinks not working with gcsfuse? Could there be a regression?

soj-hub commented 7 months ago

@bentsherman - I'm not aware of a regression. This seems like something the GCS team should chime in on, so we'll try and loop them in.

soj-hub commented 7 months ago

We've confirmed that symlinks still work. Does this issue persist and can exact steps to reproduce the issue be shared?

harper357 commented 7 months ago

Sorry, i have been very busy this week.

Like I said in the OP, I am using nf-core/quantms. If you run it with the test.config, the input file (PXD026600.sdrf.tsv) is remotely hosted (not on GS) and task 1 & task 2 work just fine.

If you instead use test.config and overwrite the input file to a copy saved on GS, task 1 completes just fine, but task 2 fails.

I believe what is happening is the output to task 1 is an unmodified version of PXD026600.sdrf.tsv, so in task 2 the symlink just points to the symlink from task 1.

From nf-core/quantms task 1 ( INPUT_CHECK):

 input:
    path input_file
    val is_sdrf

    output:
    path "*.log", emit: log
    path "${input_file}", emit: checked_file
    path "versions.yml", emit: versions

From nf-core/quantms/quantms.nf:

    INPUT_CHECK (
        file(params.input)
    )
    ch_versions = ch_versions.mix(INPUT_CHECK.out.versions)
    // TODO: OPTIONAL, you can use nf-validation plugin to create an input channel from the samplesheet with Channel.fromSamplesheet("input")
    // See the documentation https://nextflow-io.github.io/nf-validation/samplesheets/fromSamplesheet/
    // ! There is currently no tooling to help you write a sample sheet schema

    //
    // SUBWORKFLOW: Create input channel
    //
    CREATE_INPUT_CHANNEL (
        INPUT_CHECK.out.ch_input_file,
        INPUT_CHECK.out.is_sdrf
    )
archmageirvine commented 5 months ago

I'm also seeing this problem in GCS when a process depends on a file from a previous process which in turn made a symlink to the file on a mounted drive. The work around of setting stageInMode to copy on the earlier process worked as a fix for me, but it would be nice not have to do this.

cedarwarman commented 5 months ago

Also happening with me just as @archmageirvine describes. I will try stageInMode copy.

cedarwarman commented 5 months ago

I tried stageInMode copy and it worked for me as well. Another workaround that doesn't require copying all the data is to reintroduce the GCS paths with each process that uses them instead of piping them from one process to the next. For example, to join back in reference genome data I did this instead of using the existing reference symlinks from the previous process:

    // Adding reference sequences again for Nextflow GCS symlink bug
    ch_ref_by_genus = channel.of(
        ["Arabidopsis", "5", "arabidopsis_genome_id", "gs://genome/path/arabidopsis_genome_id.fasta"],
        ["Solanum", "12", "solanum_genome_id", "gs://genome/path/solanum_genome_id.fasta"]
    )
    .map {
        tuple(
            it[0], it[3], it[3] + ".amb", it[3] + ".ann", it[3] + ".bwt", it[3] + ".pac", it[3] + ".sa", it[3] + ".fai"
        )
    }
    ch_bwa_mem_with_ref = ch_bwa_mem.combine(ch_ref_by_genus, by: 0)

    // Running modules that need the ref (and can't use the ref from the first module because of the bug)
    ch_bedtools_coverage = run_bedtools_coverage(ch_bwa_mem_with_ref)
thalassemia commented 3 days ago

In my case, the problem was in the method used to get container mounts. By using PathTrie.longest() to mount the longest common paths, symlink targets can sometimes become inaccessible from within the container.

Taking the example above:

Task 1 (the task that works):

total 20
drwx------ 2 root root 4096 Mar 25 17:26 .
drwxrwxrwt 1 root root 4096 Mar 25 17:26 ..
-rw-r--r-- 1 root root    0 Mar 25 17:26 .command.err
-rw-r--r-- 1 root root    0 Mar 25 17:26 .command.out
lrwxrwxrwx 1 root root   87 Mar 25 17:26 .command.run -> /mnt/disks/[private_bucket]/nextflow/work/7a/2c6aede9e1e5e94c59563adce61992/.command.run
lrwxrwxrwx 1 root root   86 Mar 25 17:26 .command.sh -> /mnt/disks/[private_bucket]/nextflow/work/7a/2c6aede9e1e5e94c59563adce61992/.command.sh
-rw-r--r-- 1 root root    0 Mar 25 17:26 .command.trace
lrwxrwxrwx 1 root root   93 Mar 25 17:26 PXD026600.sdrf.tsv -> /mnt/disks/[private_bucket]/[input_path]/PXD026600.sdrf.tsv

Here, the longest common path is /mnt/disks/[private_bucket] so all symlinks work inside the container.

Task 2 (the task that crashes):

total 20
drwx------ 2 root root 4096 Mar 25 17:28 .
drwxrwxrwt 1 root root 4096 Mar 25 17:28 ..
-rw-r--r-- 1 root root    0 Mar 25 17:28 .command.err
-rw-r--r-- 1 root root    0 Mar 25 17:28 .command.out
lrwxrwxrwx 1 root root   87 Mar 25 17:28 .command.run -> /mnt/disks/[private_bucket]/nextflow/work/ca/919bfbc47b8a94c479636857ed9b60/.command.run
lrwxrwxrwx 1 root root   86 Mar 25 17:28 .command.sh -> /mnt/disks[private_bucket]/nextflow/work/ca/919bfbc47b8a94c479636857ed9b60/.command.sh
-rw-r--r-- 1 root root    0 Mar 25 17:28 .command.trace
lrwxrwxrwx 1 root root   93 Mar 25 17:28 PXD026600.sdrf.tsv -> /mnt/disks/[private_bucket]/nextflow/work/7a/2c6aede9e1e5e94c59563adce61992/PXD026600.sdrf.tsv

Here, the longest common path is /mnt/disks/[private_bucket]/nextflow/work so the /mnt/disks/[private_bucket]/[input_path]/PXD026600.sdrf.tsv file linked to by /mnt/disks/[private_bucket]/nextflow/work/7a/2c6aede9e1e5e94c59563adce61992/PXD026600.sdrf.tsv is not reachable inside the container.

I think this issue would be fixed if this bit of code worked properly on Google Cloud (maybe same issue as #4819). I tried replacing the linked code with return path.toRealPath() to no avail. The symlinks can only be resolved when the bucket is mounted at /mnt/disks/{bucket} using gcsfuse. This mounting happens automatically in VMs spawned by Nextflow but not necessarily in the machine used to run Nextflow, making it impossible for the Nextflow process to resolve gcsfuse symlinks. This sounds like a nightmare to address directly.

Instead, as a quick bandaid fix, I found that adding the containerOptions '--volume /mnt/disks/{bucket}:/mnt/disks/{bucket}' directive to the affected processes was enough to ensure that the whole bucket is always accessible within the container. Is this worth adding to the documentation?