ml6team / fondant

Production-ready data processing made easy and shareable
https://fondant.ai/en/stable/
Apache License 2.0
339 stars 26 forks source link

Revisit compiler VS runner functionality #627

Closed GeorgesLorre closed 10 months ago

GeorgesLorre commented 11 months ago

Running a Fondant pipeline on one of our supported runners now is a 2 step affair:

Compile

located at /src/fondant/pipeline/compiler.py

We provide an Abstract Compiler class that is reimplemented for each framework. Its main input is a fondant Pipeline object and some framework specific arguments. The compiler will/should:

  1. validate the pipeline it receives, this is still quite basic but can be improved
  2. Setup the cache keys and metadata
  3. Apply framework specific configuration
  4. Convert the pipeline to a representation that the framework understands (docker-compose.yml, IR YAML, Sagemaker Spec)
  5. save the representation to a local file

Run

located at /src/fondant/pipeline/runner.py

Again there is an abstract Runner class the be implemented for each framework. It's take a file as input this file is the output of the compiler of the same framework. The runner is responsible for the the following:

  1. Initialising the frameworks SDK
  2. Setup a client (this can be with custom credentials, etc) 3 Submit the spec
  3. Handle logic to create/update pipelines
  4. follow up on progress or generate an URL to the framework UI

Invoking the compiler/runner can be done one of 2 ways:

CLI

located at /src/fondant/cli.py

The CLI is there to make it really easy to run a fondant pipeline on a framework. There are separate commands for compile and run which just calls the above. Note that the run cli command can execute a compile before running based on wether or not it received a fondant Pipeline or an already compiled spec.

Directly

## your pipeline code here
if __name__ == "__main__":
    from fondant.pipeline.compiler import SagemakerCompiler

    compiler = SagemakerCompiler()
    compiler.compile(pipeline, output_path="spec.json")

    from fondant.pipeline.runner import SagemakerRunner

    runner = SagemakerRunner()
    runner.run(input_spec="spec.json")

Remarks with the current implementation

GeorgesLorre commented 11 months ago

Proposition

  1. We move most of the compiler logic into the runner and invest in a generic compiler:

This compiler is responsible for:

class Compiler:

    def __init__(self, pipeline: Pipeline):
        self.pipeline = pipeline

    def add_metadata(self):
        # add metadata for each component
        # including the cache key
    def validate_pipeline(self):
        # run validation checks on the pipeline
        # - check for duplicate names
        # - check if input/output match
        # - validate dependencies
        # - ...
        self.pipeline.validate()

    def compile(self, output_path: str):
        # compile the pipeline to a file
        self.validate_pipeline()

        spec = {} # fixed spec describing a fondant version
        for component in pipeline:
           self.add_metadata(component) 

        self.pipeline.save(output_path)
  1. The runner contains all framework specific logic and takes a fondant pipeline spec as input

This runner is not generic and needs to be implemented per framework. It is responsible for:

class Runner:
    def __init__(self):
        self._resolve_imports()

    def _resolve_imports(self):

        import framework
        self.framework = framework

    def compile(self, pipeline_spec: str):
        # take the pipeline spec and creates 
        # a spec related to the framework
        # this can be saved to a file

    def run(self, ref: [pipeline_spec, framework_spec]):
        # run the pipeline using the framework
        # can compile and run based on the type of input

        if pipeline_spec:
            self.compile(pipeline_spec)

        self.framework.run(framework_spec)
GeorgesLorre commented 11 months ago

The fondant pipeline spec

Having a representation (yaml, json) of a fondant pipeline would be a nice upgrade and it would help with creating reusable pipelines. I'm not quite sure yet what the format of this should be (maybe IR yaml since it aims to be agnostic)

PhilippeMoussalli commented 11 months ago

I think this makes sense.

1) Some runners have very specific settings that are set during the framework's compile. Mainly Docker for now: extra_volumes, build_args, would you then move those arguments to the runner instead? I think this would not be a big issue since that's what we currently do when we run fondant run local --extra-volumes <volumne>.

2) Would we then produce two specs everytime we run a pipeline?

RobbeSneyders commented 11 months ago

For me it's not clear yet what we want exactly from the compiling:

The only work I would do now, is to hide the compiler from the user. This already is the case for the CLI, but I would align the SDK, so a user only needs to create and use a Runner. I would also hide the compiled component spec. Either by deleting it again, or storing it in a /tmp directory (or both).

GeorgesLorre commented 11 months ago

In interest of keeping this change manageable and avoiding premature optimisation I would indeed not go for a Fondant specific pipeline spec (yet). Once we have a better idea of what we need from a spec like this we can still introduce it.

The compiler now will be quite empty, just some validation and maybe some logic to handle metadata and cache keys. But Like Robbe suggested if we can already make the framework runners nicer to use directly (without the CLI) that is already a win. And then we can implement the Sagemaker runner nicer aswel (since a sagemaker compiler does not make a lot of sense).

So for now the runner will take a Fondant Pipeline object as input and use some methods provided by the compiler.

RobbeSneyders commented 11 months ago

I think there's more things that compilation should do. The main thing is that it should freeze the pipeline definition. Which means the composition, the arguments, the versions of the components, ...

I believe we should either implement a minimal compiler which does all these things (so it won't be very minimal), or we should postpone any work on this and just remove the current Compiler from the public interface so we can change it freely in the future.