snakemake / snakemake-executor-plugin-tes

A Snakemake executor plugin for submitting jobs via GA4GH TES.
MIT License
0 stars 1 forks source link

Implement "native" TES executor #3

Open uniqueg opened 1 year ago

uniqueg commented 1 year ago

Problem

The current GA4GH TES executor wraps every TES task in a Snakemake command, essentially making them 1-step Snakemake workflows. While this design choice aligned with that of other executors and provides a high degree of compatibility in terms of features supported by Snakemake, it comes at a considerable cost:

Solution

Implement a "native" TES executor, i.e., implement the executor in such a way that commands to be executed are not wrapped by Snakemake. Instead, tesTask.executors[].command should take the value of the command to be executed, tesTask.executors.image should take on the value of the (Docker) image or Conda environment (for supported TES implementations) in which the command is to be executed, and tesTask.inputs[] and tesTask.outputs[] should contain the actual command inputs and outputs.

@vsmalladi @MattMcL4475 @svedziok @vschnei @kellrott

johanneskoester commented 9 months ago

THanks for the suggestion. Indeed, there is some overhead in calling Snakemake again within the job. The benefit is the increased flexibility (works with any storage plugin, shell, as well as script, notebook and wrapper support, as well as all the different software deployment backends (conda, apptainer, later nix etc.). I don't really see though that a change in the direction you propose needs a separate flavor of the TES executor. For the following reasons:

  1. Uploading sources is a concern the way this executor is currently implemented, but it does not need to be like that. The other executors upload sources to a storage backend (like S3) or assume a shared FS (in cluster case). Both approaches do not put a limit on the size of the source files and given that it is only some KB, there is no performance penalty.
  2. If one changes the way TES is used by Snakemake, I am completely fine to do that within this executor. Hence, no second version would be needed. For prototyping, it would also be fine to create a separate plugin and merge it in here once it works fine. THis is the nice thing about the plugin move.
  3. I rather see such a change as you suggest in the main Snakemake codebase or the executor interface package. Because then, all executors benefit from it. I have thought about this several times before and it is definitely something we want to do at some point. So far it was simply not the highest priority. One possible way is to have an extremely shallow wrapper for handling storage and software deployment that does not need the entire workflow source, and strip that down even further if the job is just a shell command. It could be a Python package that just depends on snakemake-storage-plugin-interface and the upcoming snakemake-software-deployment-plugin-interface. In case an executor chooses to provide a certain storage on its own (like TES can do for some), the storage handling can be omitted as well.

All in all, yes I agree that it makes sense to go this way, I would just say it is not something we should do within a plugin but for Snakemake itself.

johanneskoester commented 9 months ago

One final thing to note: Snakemake supports grouping jobs together (DAG partitioning). In such a case, there is really no way around a mini-workflow invocation. This feature is very important for maximizing and fine-tuning the performance real world workflows on cluster and cloud, in particular in order to limit IO and network traffic. One could simply fallback to the current approach in such case. You should keep in mind though that this feature can be used quite abundantly for sophisticated production workflows.

uniqueg commented 9 months ago

Thanks @johanneskoester, very useful feedback!

To comment on your points:

  1. Agreed that the limitations on file transfers and directory layout can probably be solved in the current implementation through some workarounds. I still don't think that a TES implementation should bother with workflow-related files (strictly/semantically speaking that would rather be a case for the WES API, but then we'd end up in an infinite loop).
  2. That makes sense, especially since, at least at this point, most TES implementers are well connected, so that we can get all the stakeholders on board to discuss the best way forward.
  3. That's interesting! Speaking from a TES perspective, stripping things down to what's really needed makes a lot of sense, because only then we could make use of TES's potential to fully abstract away the handling of storage and software deployment from the workflow engine. The tricky part here will probably be to cover the situations where simply passing through file URIs isn't enough, e.g., if output introspection is required for job scheduling or job grouping. We would then then need to make sure that these cases are handled through the executor as well.
  4. Very good point. DAG partioning could probably be handled through either the WES API and/or the use of multiple executors (in the TES sense). Btw, multiple executors could also be used for staging workflow or input/output files, sniffing metadata for the workflow engine etc., so it's useful to keep in mind that TES supports the execution of multiple commands in multiple environments on a common file system.

If you are interested, we would be happy to discuss this in one of the upcoming TES meetings. Please let me know if you are interested and we will reach out to schedule something.

johanneskoester commented 9 months ago

Yes, lets meet! Very interesting thoughts. Just send me an email to the public email address.

1. sure, it is an iterative process to me. There will always be cases where Snakemake needs the entire source tree for a step. For example, users might have scripts that import a module, etc. These things are very hard to disentangle. Snakemake allows so many things beyond what classical approaches to WMS allow, it is just not just shell commands. So I guess we can never get rid of this entirely, but we probably can get rid of it for the vast majority of jobs. 4. I did not know about the multiple executors. The problem is that they are only sequentially, but Snakemake can also be configured to have parallelism within job groups (i.e DAG partitions). However, it knows whether the group is jsut sequential, and at least in that case it could be easily mapped to TES executors.

All in all, I am so happy to see TES contributors getting interest in this plugin! Would be great if you could push it to the boundaries, and I am happy to extend or upgrade the plugin interface so that it fits the needs of TES even better.

uniqueg commented 9 months ago

Perfect :) I will reach out with an invite after the holiday break.

Frohe Festtage dir 🎄

vsmalladi commented 2 days ago

Would be good to look into this @uniqueg. Thoughts?