snakemake / snakemake-storage-plugin-gcs

A Snakemake storage plugin for Google Cloud Storage
MIT License
4 stars 5 forks source link

Parsing of gs:// remote is mangled #1

Closed vsoch closed 10 months ago

vsoch commented 11 months ago

I'm new to the storage interface so apologies in advance! I'm setting up an example to test, and my Snakefile is a derivative of the MPI variant:

# https://github.com/snakemake/snakemake/blob/main/tests/test_slurm_mpi/Snakefile
# Note that in reality, the mpi, account, and partition resources should be specified
# via --default-resources, in order to keep such infrastructure specific details out of the
# workflow definition.

localrules:
    all,
    clean,

rule all:
    input:
        "gs://snakemake-cache-dinosaur/intel-mpi/pi.calc",

rule clean:
    shell:
        "rm -f pi.calc"

rule compile:
    input:
        "gs://snakemake-cache-dinosaur/intel-mpi/pi_MPI.c",
    output:
        temp("pi_MPI"),
    log:
        "logs/compile.log",
    resources:
        mem_mb=0,
    shell:
        "mpicc -o {output} {input} &> {log}"

rule calc_pi:
    input:
        "pi_MPI",
    output:
        "pi.calc",
    log:
        "logs/calc_pi.log",
    resources:
        mem_mb=0,
        tasks=1,
        mpi="mpiexec",
    shell:
        "{resources.mpi} -n {resources.tasks} {input} 10 > {output} 2> {log}"

When it hits the function here (and the URI is determined to not be valid) the self.query is already strangely mangled:

In [1]: self.query
Out[1]: '/gs:/snakemake-cache-dinosaur/intel-mpi/pi.calc'

I'm looking at the base here https://github.com/snakemake/snakemake-interface-storage-plugins/blob/8b1156382459318a1bce27aa1dd16a7a40da8e06/snakemake_interface_storage_plugins/storage_object.py#L60-L72

If I had to guess, the storage prefix in the class here is not set, and that combined with the parsing here https://github.com/snakemake/snakemake/blob/fc252c80227e75a4fcf869f828d3c7d5d066f794/snakemake/storage.py#L79 is leading to the mangling.

As a follow up - since a remote executor requires the --default-storage-provider how do I specify that the first file to use is local? E.g., the first step should upload that pi_MPI.c to the workspace (at least that is what we are doing now and how I did it with GLS). I'd want to thus be able to set the default remote prefix but still specify a starting file to be local (or looked for locally). Or are we doing a different strategy to not upload a workflow cache?

Ping @johanneskoester

Update: confirmed query is passed into the StorageObjectBase init already mangled, going to try to trace backwards.

vsoch commented 11 months ago

okay found it! The mangling is happening here: https://github.com/snakemake/snakemake/blob/fc252c80227e75a4fcf869f828d3c7d5d066f794/snakemake/path_modifier.py#L111

I think I probably want a "local" flag so it doesn't get there, so I'll look up how to do that, and if that's not it I need to figure out if this is a bug or a "Vanessa doesn't know how to write this Snakefile" error. :laughing: Going to have some dinner first!

johanneskoester commented 10 months ago

You would need to put a storage() around those lines with the gs (now gcs) URLs.