snakemake / snakemake

This is the development home of the workflow management system Snakemake. For general information, see
https://snakemake.github.io
MIT License
2.17k stars 521 forks source link

rule option to automatically shorten pathnames passed to commands #1760

Open notestaff opened 1 year ago

notestaff commented 1 year ago

Is your feature request related to a problem? Please describe. Pathnames in Snakemake can get very long, and command lines even longer. This helps with data organization by explicitly encoding metadata in pathnames, but can break tools that weren't written to handle arbitrarily long input/output paths. Data corruption can happen when long paths get stored in too-short memory buffers, overwriting nearby data. Long pathnames can also break system limits on maximum length of filenames, pathnames or command lines. All this goes against Snakemake's premise of making workflows runnable independent of system details, adds mental load on the user to manually ensure short pathnames, and limits users' ability to fully represent metadata using comprehensible pathnames.

Describe the solution you'd like A rule option "shorten_paths: True", which symlinks or hardlinks each rule input/output file to one with a shorter pathname (but the same extension), and replaces the corresponding name in the input/ouput array before evaluating the rule body. When the rule has run, Snakemake moves each short-named output file to the corresponding original output file (similarly to what's done for 'shallow' rules). The short name could be computed based on a checksum of the full original pathname.

Describe alternatives you've considered I've tried manually changing rules to use shorter names, but this comes at a cost of more cryptic / less informative pathnames. Also, encoding less metadata in the filenames means lesser ability to represent fine-grained dependencies and avoid needless recomputation.

Additional context

dariober commented 1 year ago

Out of curiosity, would you be able to post an example of a real/realistic workflow producing such long paths?

notestaff commented 1 year ago

The workflow that's causing the issue is for comparing the results of two NGS pipelines across a range of datasets and conditions. So filenames look like "results/{PROJECT}/{POOL}/{SUBSAMP}/{PIPECONF}/{PIPENAME}/some_file.tsv" representing results for a given project (dataset), for a given sequencing pool, for a given subsampling strategy, for a given pipeline configuration, for a given pipeline name. Files representing comparison results between pairs of configurations might have two PIPECONFs in the pathname. The names of projects, pools (samples) and pipeline configurations need to be somewhat long to be comprehensible. Shell commands run by the pipeline often need to refer to several of these files.

The long filenames are causing problems in the zUMIs pipeline, which (1) constructs and runs long command lines, and (2) passes long filenames to an internal compiled library (fcountslib2, part of Rsubread) which segfaults when passed long filenames.

dariober commented 1 year ago

I see your point about having descriptive paths and filenames. However, if names become so long that the OS cannot handle them, I suspect that also a human would find it difficult to make sense of them.

Going mostly by gut feeling, I would suggest making snakemake output semi-cryptic, short filenames and in parallel make the pipeline record what is what in e.g. a csv file, a yaml file, or even an sqlite database if things get really complex. Users would then refer to that sample sheet/database to orient themselves. I mean, cramming a lot of information into a name becomes unmanageable at some point regardless of the limitations of the software.

notestaff commented 1 year ago

Humans rarely need to "make sense" of whole long pathnames at once. They just follow one path component at a time. I've had no trouble with the pretty straightforward data organization scheme in my pipeline. I think the point where tools start choking on long pathnames (and long constructed shell commands) comes sooner than the point where humans start having trouble.

pvandyken commented 1 year ago

If it's one specific tool that's causing trouble, you could probably work around it manually within the rule by using shadow="shallow" and adding something like this to your shell:

shell:
    "ln -s {input.foo} shortname.ext; "
    "main_command shortname.ext {output}"

Because of the shadow dir, shortname.ext will get automatically deleted when the job finishes and there won't be any overlap with other jobs.

If many rules require the above pattern, you should be able to write a function to generate it as well.

MatteoLacki commented 3 months ago

I also run into the problem of long paths. To solve it, right now I am writing custom parsing rules of outputs that include path compression and decompression. The problem of long paths occurs when you have pipelines with a lot of rules, each adding few wildcards to the game. This gets tricky, when some rules must reuse outputs of previous rules or some other branch in the DAG.

Example:

def match_precursors_and_fragments_path_parser(wildcards):
    combined_path = decompress( str(wildcards.combined_path) )
    ms1stats_path, ms2stats_path, ms1clusters_path, ms2clusters_path = map( compress, extract_outermost_brackets(combined_path) )
    return dict(
        MS1_stats=f"partial/cluster_stats/ms_level=1/{ms1stats_path}/cluster_stats.parquet",
        MS2_stats=f"partial/cluster_stats/ms_level=2/{ms2stats_path}/cluster_stats.parquet",
        MS1_clusters=f"partial/clusters/ms_level=1/{ms1clusters_path}/clusters.startrek",
        MS2_clusters=f"partial/clusters/ms_level=1/{ms2clusters_path}/clusters.startrek",
    )
rule match_precursors_and_fragments:
    input:
        unpack(match_precursors_and_fragments_path_parser),
        script="configs/matching/matches={matches}/{matches}.py",
        config="configs/matching/matches={matches}/matches_config={matches_config}.toml",
    output:
        TEMP(directory("partial/edges/matches={matches}/matches_config={matches_config}/{combined_path}/edges.startrek"), when_global_lower_than=3),
    wildcard_constraints:
        matches=r"[^/]+",
        matches_config=r"[^/]+",
    shell:
        "python {input.script} {input.MS1_stats} {input.MS2_stats} {input.MS1_clusters} {input.MS2_clusters} {input.config} {output}"

Above, I also simply don't use any regexes (devil invented those). While constructing paths (in a separate script, cause who can remember how an output of 40 rules will look like), I simply use brackets to tell different parts apart. But this gets paths super long, but quite repetitive, hence compression.

There are natural limits to the size of file/directory on linux of 255 bytes. Also, the whole path cannot exceed 4096 bytes.

Any better ideas for long pipelines?

MatteoLacki commented 3 months ago

The decompress and compress use Brotli + base64. TEMP above is a custom wrapper around temp to discern levels of temporariness.

MatteoLacki commented 3 months ago

Humans rarely need to "make sense" of whole long pathnames at once. They just follow one path component at a time. I've had no trouble with the pretty straightforward data organization scheme in my pipeline. I think the point where tools start choking on long pathnames (and long constructed shell commands) comes sooner than the point where humans start having trouble.

I also agree with that: my solution with human-readability is to make additional rules responsible for collecting results from partial human-unreadable paths into some folders. Works bests with file systems with COW for big files.