Follow Snakemake recommended folder structure

kbseah commented 9 months ago

Snakemake recommended folder structure: https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html#distribution-and-reproducibility

I think that the workflow (rules etc.) should be in dedicated subfolder. The pipeline should simply be forked and modified for each new dataset (alternatively it could be uploaded to WorkflowHub in the future).

Detailed explanation:

Current usage scenario of the pipeline: User clones a single copy of the workflow, and uses the same workflow to analyze multiple datasets by writing individual config files for each. The input and output paths for each dataset are specified in the config files and are independent of the workflow, i.e. the input/output folders are not necessarily subfolders of the workflow. In order to accommodate this usage pattern, the workdir is manually specified and also not necessarily the path at which the Snakemake command is run.

The original motivation AFAIK was:

Avoid having to repeatedly clone pipeline and create the same conda environments again for each dataset, which may clutter file server storage space;
Reduce complexity for users: they accept the pipeline as-is and all options are specified in the config files, which are portable.

However,

There's not much real storage space savings, because this is handled under the hood by Conda using hard links anyway
Versioning becomes more complicated, because if a user has already run the pipeline on some datasets and then updates the pipeline, they will have to remember which datasets they have re-run the pipeline with the updated version, and which not.
Path management becomes messy because the workdir is specified separately from the actual path where Snakemake is run, which in turn may be different from the pipeline workflow path. Snakemake implicitly creates a hidden .snakemake folder in the folder where Snakemake is invoked (for logs of the Snakemake runs), but it also creates a hidden .snakemake folder in the manually specified workdir for the Conda envs and rule-specific logs.
Therefore in order to troubleshoot, users have to remember which path they ran Snakemake from, and track down log files from two different hidden folders. If Snakemake is called from the "wrong" path, the pipeline v2.0.0 encounters a bug if run with Snakemake v6, where the path to Python script in the workflow script subfolder is incorrectly called, even though the workdir and Snakefile paths are specified. (Bug disappears with Snakemake v7+).
Users may specify absolute paths, which goes against the design assumptions of Snakemake and causes unexpected problems
The --archive option cannot be used with the pipeline as-is because it assumes that the repo follows the recommended structure.

kbseah commented 9 months ago

Rules for inclusion in the Snakemake workflow catalog: https://snakemake.github.io/snakemake-workflow-catalog/?rules=true

kbseah commented 8 months ago

Observed with Snakemake 8:

environment variables dumped in a "wall of text", unclear what's triggering it, seems to be what's reported here: https://github.com/snakemake/snakemake/issues/2624

monagrland / MB_Pipeline

Follow Snakemake recommended folder structure #17