Open harper357 opened 8 months ago
It looks like gcsfuse has supported symlinks for a while: https://github.com/GoogleCloudPlatform/gcsfuse/issues/12 . Maybe there was a regression?
The staging command for google batch is defined here:
@harper357 can you give me a directory listing for a task directory when using symlink vs link? Just do an ls -al
in the task script. I'm guessing that symlink'ed files are just not showing up for some reason
Sorry, I am a little confused by your ask, I am using a GCS bucket as the workdir (as per the documentation) so I and I am not sure how I how I would catch the worker node in time to ssh in time before it crashes.
Are you asking for the GCS directory listing?
On Mon, Mar 25, 2024, 6:09 AM Ben Sherman @.***> wrote:
It looks like gcsfuse has supported symlinks for a while: GoogleCloudPlatform/gcsfuse#12 https://github.com/GoogleCloudPlatform/gcsfuse/issues/12 . Maybe there was a regression?
The staging command for google batch is defined here:
@harper357 https://github.com/harper357 can you give me a directory listing for a task directory when using symlink vs link? Just do an ls -al in the task script. I'm guessing that symlink'ed files are just not showing up for some reason
— Reply to this email directly, view it on GitHub https://github.com/nextflow-io/nextflow/issues/4845#issuecomment-2017974743, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZ4ZB2U44HIL23M7FQH34TY2AOYNAVCNFSM6AAAAABFEAJKWGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJXHE3TINZUGM . You are receiving this because you were mentioned.Message ID: @.***>
I was thinking to just add ls -al
to the process script, then you should see the directory listing in the error message you showed
Task 1 (the task that works):
total 20
drwx------ 2 root root 4096 Mar 25 17:26 .
drwxrwxrwt 1 root root 4096 Mar 25 17:26 ..
-rw-r--r-- 1 root root 0 Mar 25 17:26 .command.err
-rw-r--r-- 1 root root 0 Mar 25 17:26 .command.out
lrwxrwxrwx 1 root root 87 Mar 25 17:26 .command.run -> /mnt/disks/[private_bucket]/nextflow/work/7a/2c6aede9e1e5e94c59563adce61992/.command.run
lrwxrwxrwx 1 root root 86 Mar 25 17:26 .command.sh -> /mnt/disks/[private_bucket]/nextflow/work/7a/2c6aede9e1e5e94c59563adce61992/.command.sh
-rw-r--r-- 1 root root 0 Mar 25 17:26 .command.trace
lrwxrwxrwx 1 root root 93 Mar 25 17:26 PXD026600.sdrf.tsv -> /mnt/disks/[private_bucket]/[input_path]/PXD026600.sdrf.tsv
Task 2 (the task that crashes):
total 20
drwx------ 2 root root 4096 Mar 25 17:28 .
drwxrwxrwt 1 root root 4096 Mar 25 17:28 ..
-rw-r--r-- 1 root root 0 Mar 25 17:28 .command.err
-rw-r--r-- 1 root root 0 Mar 25 17:28 .command.out
lrwxrwxrwx 1 root root 87 Mar 25 17:28 .command.run -> /mnt/disks/[private_bucket]/nextflow/work/ca/919bfbc47b8a94c479636857ed9b60/.command.run
lrwxrwxrwx 1 root root 86 Mar 25 17:28 .command.sh -> /mnt/disks[private_bucket]/nextflow/work/ca/919bfbc47b8a94c479636857ed9b60/.command.sh
-rw-r--r-- 1 root root 0 Mar 25 17:28 .command.trace
lrwxrwxrwx 1 root root 93 Mar 25 17:28 PXD026600.sdrf.tsv -> /mnt/disks/[private_bucket]/nextflow/work/7a/2c6aede9e1e5e94c59563adce61992/PXD026600.sdrf.tsv
How is the [input_path]
different in the first example?
It is just a couple of subfolders on GCS. I doubled check that it is actually the correct path. In other words, in Task 1 it points to the file on GCS, in Task 2 it points to the symlink from Task 1.
Small correction, I never pushed the stageInMode='link'
, so the correction that worked for me was stageInMode='copy'
@hnawar @soj-hub Do either of you know anything about symlinks not working with gcsfuse? Could there be a regression?
@bentsherman - I'm not aware of a regression. This seems like something the GCS team should chime in on, so we'll try and loop them in.
We've confirmed that symlinks still work. Does this issue persist and can exact steps to reproduce the issue be shared?
Sorry, i have been very busy this week.
Like I said in the OP, I am using nf-core/quantms. If you run it with the test.config
, the input file (PXD026600.sdrf.tsv
) is remotely hosted (not on GS) and task 1 & task 2 work just fine.
If you instead use test.config
and overwrite the input file to a copy saved on GS, task 1 completes just fine, but task 2 fails.
I believe what is happening is the output to task 1 is an unmodified version of PXD026600.sdrf.tsv
, so in task 2 the symlink just points to the symlink from task 1.
From nf-core/quantms task 1 ( INPUT_CHECK):
input:
path input_file
val is_sdrf
output:
path "*.log", emit: log
path "${input_file}", emit: checked_file
path "versions.yml", emit: versions
From nf-core/quantms/quantms.nf:
INPUT_CHECK (
file(params.input)
)
ch_versions = ch_versions.mix(INPUT_CHECK.out.versions)
// TODO: OPTIONAL, you can use nf-validation plugin to create an input channel from the samplesheet with Channel.fromSamplesheet("input")
// See the documentation https://nextflow-io.github.io/nf-validation/samplesheets/fromSamplesheet/
// ! There is currently no tooling to help you write a sample sheet schema
//
// SUBWORKFLOW: Create input channel
//
CREATE_INPUT_CHANNEL (
INPUT_CHECK.out.ch_input_file,
INPUT_CHECK.out.is_sdrf
)
I'm also seeing this problem in GCS when a process depends on a file from a previous process which in turn made a symlink to the file on a mounted drive. The work around of setting stageInMode
to copy
on the earlier process worked as a fix for me, but it would be nice not have to do this.
Also happening with me just as @archmageirvine describes. I will try stageInMode
copy
.
I tried stageInMode copy
and it worked for me as well. Another workaround that doesn't require copying all the data is to reintroduce the GCS paths with each process that uses them instead of piping them from one process to the next. For example, to join back in reference genome data I did this instead of using the existing reference symlinks from the previous process:
// Adding reference sequences again for Nextflow GCS symlink bug
ch_ref_by_genus = channel.of(
["Arabidopsis", "5", "arabidopsis_genome_id", "gs://genome/path/arabidopsis_genome_id.fasta"],
["Solanum", "12", "solanum_genome_id", "gs://genome/path/solanum_genome_id.fasta"]
)
.map {
tuple(
it[0], it[3], it[3] + ".amb", it[3] + ".ann", it[3] + ".bwt", it[3] + ".pac", it[3] + ".sa", it[3] + ".fai"
)
}
ch_bwa_mem_with_ref = ch_bwa_mem.combine(ch_ref_by_genus, by: 0)
// Running modules that need the ref (and can't use the ref from the first module because of the bug)
ch_bedtools_coverage = run_bedtools_coverage(ch_bwa_mem_with_ref)
In my case, the problem was in the method used to get container mounts. By using PathTrie.longest()
to mount the longest common paths, symlink targets can sometimes become inaccessible from within the container.
Taking the example above:
Task 1 (the task that works):
total 20 drwx------ 2 root root 4096 Mar 25 17:26 . drwxrwxrwt 1 root root 4096 Mar 25 17:26 .. -rw-r--r-- 1 root root 0 Mar 25 17:26 .command.err -rw-r--r-- 1 root root 0 Mar 25 17:26 .command.out lrwxrwxrwx 1 root root 87 Mar 25 17:26 .command.run -> /mnt/disks/[private_bucket]/nextflow/work/7a/2c6aede9e1e5e94c59563adce61992/.command.run lrwxrwxrwx 1 root root 86 Mar 25 17:26 .command.sh -> /mnt/disks/[private_bucket]/nextflow/work/7a/2c6aede9e1e5e94c59563adce61992/.command.sh -rw-r--r-- 1 root root 0 Mar 25 17:26 .command.trace lrwxrwxrwx 1 root root 93 Mar 25 17:26 PXD026600.sdrf.tsv -> /mnt/disks/[private_bucket]/[input_path]/PXD026600.sdrf.tsv
Here, the longest common path is /mnt/disks/[private_bucket]
so all symlinks work inside the container.
Task 2 (the task that crashes):
total 20 drwx------ 2 root root 4096 Mar 25 17:28 . drwxrwxrwt 1 root root 4096 Mar 25 17:28 .. -rw-r--r-- 1 root root 0 Mar 25 17:28 .command.err -rw-r--r-- 1 root root 0 Mar 25 17:28 .command.out lrwxrwxrwx 1 root root 87 Mar 25 17:28 .command.run -> /mnt/disks/[private_bucket]/nextflow/work/ca/919bfbc47b8a94c479636857ed9b60/.command.run lrwxrwxrwx 1 root root 86 Mar 25 17:28 .command.sh -> /mnt/disks[private_bucket]/nextflow/work/ca/919bfbc47b8a94c479636857ed9b60/.command.sh -rw-r--r-- 1 root root 0 Mar 25 17:28 .command.trace lrwxrwxrwx 1 root root 93 Mar 25 17:28 PXD026600.sdrf.tsv -> /mnt/disks/[private_bucket]/nextflow/work/7a/2c6aede9e1e5e94c59563adce61992/PXD026600.sdrf.tsv
Here, the longest common path is /mnt/disks/[private_bucket]/nextflow/work
so the /mnt/disks/[private_bucket]/[input_path]/PXD026600.sdrf.tsv
file linked to by /mnt/disks/[private_bucket]/nextflow/work/7a/2c6aede9e1e5e94c59563adce61992/PXD026600.sdrf.tsv
is not reachable inside the container.
I think this issue would be fixed if this bit of code worked properly on Google Cloud (maybe same issue as #4819). I tried replacing the linked code with return path.toRealPath()
to no avail. The symlinks can only be resolved when the bucket is mounted at /mnt/disks/{bucket}
using gcsfuse. This mounting happens automatically in VMs spawned by Nextflow but not necessarily in the machine used to run Nextflow, making it impossible for the Nextflow process to resolve gcsfuse symlinks. This sounds like a nightmare to address directly.
Instead, as a quick bandaid fix, I found that adding the containerOptions '--volume /mnt/disks/{bucket}:/mnt/disks/{bucket}'
directive to the affected processes was enough to ensure that the whole bucket is always accessible within the container. Is this worth adding to the documentation?
Bug report
(Please follow this template replacing the text between parentheses with the requested information)
Expected behavior and actual behavior
When running a pipeline (in this instance nf-core/quantms) using input files hosted on GCS and a workdir on GCS, Nextflow attempts to stage the files using a symlink, but GCS doesn't seem to support them.
Steps to reproduce the problem
Using the nf-core/quantms pipeline and test_dia profile, tasks 1 and 2 work fine.
Now try hosting the seed file from the test_dia profile on GCS and run the pipeline with it as the input, task 1 completes, but task 2 fails. The error report states that the input file for task 2 is missing.
Adding in
stageInMode = 'link'
fixes the issue.Program output
Environment
Launched from Tower to Google Batch
Additional context
(Add any other context about the problem here)
Update: It looks like I forgot to push the
stageInMode = 'link'
commit when before I did my final test, so instead I tested thatstageInMode = 'copy'
fixes the issue.