Revisit compiler VS runner functionality

GeorgesLorre commented 11 months ago

Running a Fondant pipeline on one of our supported runners now is a 2 step affair:

Compile

located at /src/fondant/pipeline/compiler.py

We provide an Abstract Compiler class that is reimplemented for each framework. Its main input is a fondant Pipeline object and some framework specific arguments. The compiler will/should:

validate the pipeline it receives, this is still quite basic but can be improved
Setup the cache keys and metadata
Apply framework specific configuration
Convert the pipeline to a representation that the framework understands (docker-compose.yml, IR YAML, Sagemaker Spec)
save the representation to a local file

Run

located at /src/fondant/pipeline/runner.py

Again there is an abstract Runner class the be implemented for each framework. It's take a file as input this file is the output of the compiler of the same framework. The runner is responsible for the the following:

Initialising the frameworks SDK
Setup a client (this can be with custom credentials, etc) 3 Submit the spec
Handle logic to create/update pipelines
follow up on progress or generate an URL to the framework UI

Invoking the compiler/runner can be done one of 2 ways:

CLI

located at /src/fondant/cli.py

The CLI is there to make it really easy to run a fondant pipeline on a framework. There are separate commands for compile and run which just calls the above. Note that the run cli command can execute a compile before running based on wether or not it received a fondant Pipeline or an already compiled spec.

Directly

## your pipeline code here
if __name__ == "__main__":
    from fondant.pipeline.compiler import SagemakerCompiler

    compiler = SagemakerCompiler()
    compiler.compile(pipeline, output_path="spec.json")

    from fondant.pipeline.runner import SagemakerRunner

    runner = SagemakerRunner()
    runner.run(input_spec="spec.json")

Remarks with the current implementation

While the split between compiling and running might be nice in some (advanced) cases it is in most cases an extra step that could be avoided.
Both the compiler and runner require framework specific imports which we only import when actually initialising these classes but we have to do it in 2 places.
The current Sagemaker implementation doesn't cleanly fit in this paradigm of compile first and run later. We already need credentials and the correct access even for compiling. It also already creates artifacts during compilation.
There is no standard way of converting a fondant pipeline to a specification (every compiler does it differently)
Framework specific arguments are either part of the compiler or the runner

GeorgesLorre commented 11 months ago

Proposition

We move most of the compiler logic into the runner and invest in a generic compiler:

This compiler is responsible for:

validation of the pipeline
adding metadata (cache key)
saving the fondant pipeline yaml to a file/spec

class Compiler:

    def __init__(self, pipeline: Pipeline):
        self.pipeline = pipeline

    def add_metadata(self):
        # add metadata for each component
        # including the cache key
    def validate_pipeline(self):
        # run validation checks on the pipeline
        # - check for duplicate names
        # - check if input/output match
        # - validate dependencies
        # - ...
        self.pipeline.validate()

    def compile(self, output_path: str):
        # compile the pipeline to a file
        self.validate_pipeline()

        spec = {} # fixed spec describing a fondant version
        for component in pipeline:
           self.add_metadata(component) 

        self.pipeline.save(output_path)

The runner contains all framework specific logic and takes a fondant pipeline spec as input

This runner is not generic and needs to be implemented per framework. It is responsible for:

importing the extra packages
running a pipeline (and possible compiling first)
handle pipeline versions (update vs create)
logging of pipeline status

class Runner:
    def __init__(self):
        self._resolve_imports()

    def _resolve_imports(self):

        import framework
        self.framework = framework

    def compile(self, pipeline_spec: str):
        # take the pipeline spec and creates 
        # a spec related to the framework
        # this can be saved to a file

    def run(self, ref: [pipeline_spec, framework_spec]):
        # run the pipeline using the framework
        # can compile and run based on the type of input

        if pipeline_spec:
            self.compile(pipeline_spec)

        self.framework.run(framework_spec)

GeorgesLorre commented 11 months ago

The fondant pipeline spec

Having a representation (yaml, json) of a fondant pipeline would be a nice upgrade and it would help with creating reusable pipelines. I'm not quite sure yet what the format of this should be (maybe IR yaml since it aims to be agnostic)

PhilippeMoussalli commented 11 months ago

I think this makes sense.

1) Some runners have very specific settings that are set during the framework's compile. Mainly Docker for now: extra_volumes, build_args, would you then move those arguments to the runner instead? I think this would not be a big issue since that's what we currently do when we run fondant run local --extra-volumes <volumne>.

2) Would we then produce two specs everytime we run a pipeline?

RobbeSneyders commented 11 months ago

For me it's not clear yet what we want exactly from the compiling:

Compiling to a Fondant-specific pipeline format could be nice since it allows you to track and promote pipelines across environments etc. But it will also require us to provide all the tooling to do this. Choosing an existing standard (as far as one is available, eg. IR YAML) could alleviate this at least partially.
Compiling to a framework-specific pipeline format allows using that framework's tools instead. Eg. Vertex pipelines can be stored in the GCP artifact registry. However for Sagemaker this leads to issues as you describe above. Note that Vertex pipelines are IR YAML, so we could still benefit from the artifact registry if we use that as a fondant-specific format.

The only work I would do now, is to hide the compiler from the user. This already is the case for the CLI, but I would align the SDK, so a user only needs to create and use a Runner. I would also hide the compiled component spec. Either by deleting it again, or storing it in a /tmp directory (or both).

GeorgesLorre commented 11 months ago

In interest of keeping this change manageable and avoiding premature optimisation I would indeed not go for a Fondant specific pipeline spec (yet). Once we have a better idea of what we need from a spec like this we can still introduce it.

The compiler now will be quite empty, just some validation and maybe some logic to handle metadata and cache keys. But Like Robbe suggested if we can already make the framework runners nicer to use directly (without the CLI) that is already a win. And then we can implement the Sagemaker runner nicer aswel (since a sagemaker compiler does not make a lot of sense).

So for now the runner will take a Fondant Pipeline object as input and use some methods provided by the compiler.

RobbeSneyders commented 11 months ago

I think there's more things that compilation should do. The main thing is that it should freeze the pipeline definition. Which means the composition, the arguments, the versions of the components, ...

I believe we should either implement a minimal compiler which does all these things (so it won't be very minimal), or we should postpone any work on this and just remove the current Compiler from the public interface so we can change it freely in the future.

ml6team / fondant

Revisit compiler VS runner functionality #627