Refactor workflow engine to be able to serialize snakefiles on the fly

DanInci commented 1 month ago

Refactor the workflow engine code, such that the Snakemake file is generated on the fly. The following generic interface is provided for workflow execution:

class WorkflowEngine:
    """Interface for the workflow engine."""

    def run_workflow(self, benchmark: Benchmark, cores: int = 1, dryrun: bool = False, work_dir: str = os.getcwd(), **kwargs):
        """
        Serializes & runs benchmark workflow.

        Args:
            benchmark (Benchmark): benchmark to run
            cores (int): number of cores to run
            dryrun (bool): validate the workflow with the benchmark without actual execution
            work_dir (str): working directory
            **kwargs: keyword arguments to pass to the workflow engine

        Returns:
        - Status code of the workflow run.
        """
        raise NotImplementedError("Method not implemented yet")

    def serialize_workflow(self, benchmark: Benchmark, output_dir: str = os.getcwd()):
        """
        Serializes a workflow file for the benchmark.

        Args:
            benchmark (Benchmark): benchmark to serialize
            output_dir (str): output directory for the workflow file

        Returns:
        - Workflow file path.
        """
        raise NotImplementedError("Method not implemented yet")

    def run_node_workflow(self, node: BenchmarkNode, cores: int = 1, dryrun: bool = False, work_dir: str = os.getcwd(), **kwargs):
        """
        Serializes & runs benchmark node workflow.

        Args:
            node (Benchmark): benchmark node to run
            cores (int): number of cores to run
            dryrun (bool): validate the workflow with the benchmark without actual execution
            work_dir (str): working directory
            **kwargs: keyword arguments to pass to the workflow engine

        Returns:
        - Status code of the workflow run.
        """
        raise NotImplementedError("Method not implemented yet")

    def serialize_node_workflow(self, node: BenchmarkNode, output_dir: str = os.getcwd()):
        """
        Serializes a workflow file for a benchmark node.

        Args:
            node (BenchmarkNode): benchmark node to serialize
            output_dir (str): output directory for the workflow file

        Returns:
        - Workflow file path.
        """
        raise NotImplementedError("Method not implemented yet")

almutlue commented 1 month ago

Two quick comments (proper code review will follow):

Type hints and docstrings would really help code readability
There are some controversies around using pickle (mainly for security reasons, but also because objects will not be compatible across versions/systems). I wonder about the advantage of using it here? Why not starting with the benchmark.yaml as input and initiate the benchmark/node class from there?

DanInci commented 3 weeks ago

Two quick comments (proper code review will follow):

Type hints and docstrings would really help code readability

There are some controversies around using pickle (mainly for security reasons, but also because objects will not be compatible across versions/systems). I wonder about the advantage of using it here? Why not starting with the benchmark.yaml as input and initiate the benchmark/node class from there?

I agree, will ad them.
Was not aware about the security issues. Thanks for sharing. My plan was to design the interface such that it accepts the already constructed and validated benchmark model and uses it in the workflow engine. The pickeling was to avoid deconstructing & reconstructing the model again, but there's no other harm in it, other than doing that twice.

DanInci commented 2 weeks ago

Thanks a lot for this very extensive PR.

I added some questions/suggestions. They are mainly about reducing duplication to keep the code simple and clean. As more general comments:

Why not starting from the benchmark.yaml file and initialize the WorkflowEnginge from there? This way we could also skip reassigning the benchmark.yaml file as Benchmark/Converter attribute.

Raise Errors with a clear error message instead of using assert

Type hints are also used to describe return values, e.g. def func(arg:argtype) -> returntype:. I think code readability would be much easier.

There are quite some places were the code looks brittle and will probably fail upon small changes, e.g. in the data schema or the path description. I tried to point them out, but probably missed some. In general it is good to e.g. use named attributes instead of positional calls. I would also try to avoid string manipulation as much as possible.

My intuition tells me that the workflow engine should be called after the benchmark file has been validated, and the benchmark model constructed. This is reflected by the interface to the module that accepts only benchmark: Benchmark or node: BenchmarkNode. The alternative allows direct use of the module, without doing the proper validations. And in this case I'd argue is better to reflect the desired responsability through the interface. The need to access the benchmark yaml is just an implementation detail of Snakemake, maybe with other workflow languages, we wouldn't need access to it.
Good point, asserts were leftover from my prototyping phase.
I added some more type hints and refactored some of the brittle code. It should be better now.
Unfortunately, I think most of the string manipulations are needed. They are an implementation detail of how snakemake magic works, and I tried as much as I can to contain them in a few private methods. Although I improved some of the parsing logic, hopefully its better now.

omnibenchmark / omni_workflow

Refactor workflow engine to be able to serialize snakefiles on the fly #10