Open uniqueg opened 1 year ago
THanks for the suggestion. Indeed, there is some overhead in calling Snakemake again within the job. The benefit is the increased flexibility (works with any storage plugin, shell, as well as script, notebook and wrapper support, as well as all the different software deployment backends (conda, apptainer, later nix etc.). I don't really see though that a change in the direction you propose needs a separate flavor of the TES executor. For the following reasons:
All in all, yes I agree that it makes sense to go this way, I would just say it is not something we should do within a plugin but for Snakemake itself.
One final thing to note: Snakemake supports grouping jobs together (DAG partitioning). In such a case, there is really no way around a mini-workflow invocation. This feature is very important for maximizing and fine-tuning the performance real world workflows on cluster and cloud, in particular in order to limit IO and network traffic. One could simply fallback to the current approach in such case. You should keep in mind though that this feature can be used quite abundantly for sophisticated production workflows.
Thanks @johanneskoester, very useful feedback!
To comment on your points:
If you are interested, we would be happy to discuss this in one of the upcoming TES meetings. Please let me know if you are interested and we will reach out to schedule something.
Yes, lets meet! Very interesting thoughts. Just send me an email to the public email address.
1.
sure, it is an iterative process to me. There will always be cases where Snakemake needs the entire source tree for a step. For example, users might have scripts that import a module, etc. These things are very hard to disentangle. Snakemake allows so many things beyond what classical approaches to WMS allow, it is just not just shell commands. So I guess we can never get rid of this entirely, but we probably can get rid of it for the vast majority of jobs.
4.
I did not know about the multiple executors. The problem is that they are only sequentially, but Snakemake can also be configured to have parallelism within job groups (i.e DAG partitions). However, it knows whether the group is jsut sequential, and at least in that case it could be easily mapped to TES executors.
All in all, I am so happy to see TES contributors getting interest in this plugin! Would be great if you could push it to the boundaries, and I am happy to extend or upgrade the plugin interface so that it fits the needs of TES even better.
Perfect :) I will reach out with an invite after the holiday break.
Frohe Festtage dir 🎄
Would be good to look into this @uniqueg. Thoughts?
Problem
The current GA4GH TES executor wraps every TES task in a Snakemake command, essentially making them 1-step Snakemake workflows. While this design choice aligned with that of other executors and provides a high degree of compatibility in terms of features supported by Snakemake, it comes at a considerable cost:
tesTask.executors[].command
is asnakemake
callconfigfile
to forego changing the workflow descriptors when using different remote storage providers was not supported when I tried; admittedly, those could be errors on my side); a native TES executor could deal with cloud storage insteadSolution
Implement a "native" TES executor, i.e., implement the executor in such a way that commands to be executed are not wrapped by Snakemake. Instead,
tesTask.executors[].command
should take the value of the command to be executed,tesTask.executors.image
should take on the value of the (Docker) image or Conda environment (for supported TES implementations) in which the command is to be executed, andtesTask.inputs[]
andtesTask.outputs[]
should contain the actual command inputs and outputs.@vsmalladi @MattMcL4475 @svedziok @vschnei @kellrott