ml6team / fondant

Production-ready data processing made easy and shareable
https://fondant.ai/en/stable/
Apache License 2.0
339 stars 26 forks source link

Create a local pipeline runner leveraging Docker #181

Closed PhilippeMoussalli closed 1 year ago

PhilippeMoussalli commented 1 year ago

Problem Statement

For ease of development and getting started with Fondant having a way to easily run a toy pipeline locally would be beneficial. Now we have to start a pipeline on kubeflow to debug it or run it locally with an awkward script.

The current implementation of the Pipeline, Client and ComponentOp are quite kubeflow focused so we will need to abstract some things to make this possible.

ComponentOp

The componentOp or component operation just represents the component_spec and the runtime configuration (hardware specs + arguments)

Pipeline

You can register the components on the pipeline with the needed dependencies to create a graph. The pipeline has all the logic to resolve the graph and validate the components ( do the in and outputs match).

Client

in order to run the pipeline you need a client that knows how to compile and submit the pipeline (this is only kubeflow now)

Proposed Approach

We remove all references to k8s and kubeflow from ComponentOp and The Pipeline and we make the client responsible of interpreting and running the graph defined in the pipeline. That way by switching clients we can resuse pipelines.


#PSEUDOCODE

from fondant.pipeline import ComponentOp, KubeflowClient, LocalClient, Pipeline

def build_pipeline():
    pipeline = Pipeline(pipeline_name="example pipeline", base_path="fs://bucket")

    load_from_hub_op = ComponentOp.from_registry(
        name="load_from_hf_hub",
        arguments={"dataset_name": "lambdalabs/pokemon-blip-captions"},
    )
    pipeline.add_op(load_from_hub_op)

    caption_images_op = ComponentOp(
        component_spec_path="components/captioning_component/fondant_component.yaml",
        arguments={
            "batch_size": 2,
            "max_new_tokens": 50,
        },
        number_of_gpus=1,
        node_pool_name="model-inference-pool",
    )
    pipeline.add_op(caption_images_op, dependencies=load_from_hub_op)
    return pipeline

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="dataflow_boilerplate")
    parser.add_argument(
        "-m",
        "--mode",
        help="Mode to run pipeline in.",
        choices=["local", "kubeflow"],
        default="local",
    )
    args = parser.parse_args()
    if args["local"]:
        client = LocalClient()
    else:
        client = KubeflowClient(host="https://kfp-host.com/")
    pipeline = build_pipeline()
    client.compile_and_run(pipeline=pipeline)

I see 2 ways of achieving this and they are not mutual exclusive they just operate on a different level:

Docker

All components are docker images that contain all needed libraries and code to run the component so we could start every component of a pipeline as a docker image. We can control the IO by leveraging volume mounts and we could even apply hardware specifications (https://docs.docker.com/config/containers/resource_constraints/). There is a python library build by docker that we could leverage to manage the running containers (https://github.com/docker/docker-py).


#PSEUDOCODE

import docker

def DockerClient(Client):
    def __init__(self, storage_dir: str):
        self.client = docker.from_env()
        self.storage_dir = storage_dir

    def compile_and_run(self, pipeline):
        run_volume = client.volumes.create(name="foobar", path=self.storage_dir)

        for component in pipeline.components():
            # TODO: add some logging
            container = self.client.containers.run(
                component.spec.image, "python main.py"
                volume=run_volume
            )
            for line in container.logs(stream=True):
                # TODO: do some error catching ?
                print(line.strip())

Pro's

Cons

Plain Python

Similar to the script we had before where we run the main.py of every component in sequence and passing the paths along.


#PSEUDOCODE

import importlib

def DirectClient(Client):
    def __init__(self):
        pass
    def compile_and_run(self, pipeline):
        # we'll assume the image name is the folder name and the calling file is next to the components
        # .
        # ├── README.md
        # ├── build_images.sh
        # ├── components
        # │   ├── caption_images
        # │   │   └── src
        # │   │       └── main.py
        # │   ├── download_images
        # │   │   └── src
        # │   │       └── main.py
        # │   ├── generate_prompts
        # │   │   └── src
        # │   │       └── main.py
        # └── pipeline.py

        # probably need some arg setup for the manifest paths and component arguments
        for step in pipeline.components():
            component = importlib.import_module("component", f"components.{step.name}.main")
            component.run()

Pro's

---> WE chose to start with the docker implementation since it seems to over more benefits while not being that more complex

Implementation Steps/Tasks

### Tasks
- [ ] #184
- [ ] https://github.com/ml6team/fondant/issues/185
- [ ] https://github.com/ml6team/fondant/issues/186
- [ ] https://github.com/ml6team/fondant/issues/187
- [ ] https://github.com/ml6team/fondant/issues/188
- [ ] https://github.com/ml6team/fondant/issues/189
- [ ] https://github.com/ml6team/fondant/issues/190

Potential Impact

Testing

Testing the pipeline compiler should be straightforward:

Testing the runner could be more tricky:

Documentation

The local runner should be included in all documentation and should be promoted as a way to get started easily.

PhilippeMoussalli commented 1 year ago

Converted to issue so I can comment :)

Thanks @GeorgesLorre for the detailed description!

in order to run the pipeline you need a client that knows how to compile and submit the pipeline (this is only kubeflow now). We remove all references to k8s and kubeflow from ComponentOp and The Pipeline and we make the client responsible of interpreting and running the graph defined in the pipeline

Docker approach

I think I prefer this approach since it replicates more closely how components will actually be run.

Python approach

I like this one better but indeed you might end up with multiple environments and requirements. Might not be the most error-proof solution but its the simplest.

You mentioned that the pros is that it's easy to run partial pipelines or single components but I think we can achieve the same thing with the docker approach

RobbeSneyders commented 1 year ago

I think docker is a reasonable dependency and that the docker approach will be the best choice on the long term as it is the most robust. It's also the most complex approach though, and the Python approach has the big benefit that we can move forward with it quickly.

I would therefore propose to start with the Python approach as a quick fix and work towards the docker-based approach on the longer term.

Some feedback on the cons:

  • we expect a fixed folder structure

This is already the case since the command run by Kubeflow is hardcoded in Fondant.

  • we'll need init.py in every folder

We should execute the scripts via subprocess, which is how they will be run in the pipeline as well. Then we don't need an __init__.py file.

  • how to handle symlinked components

We shouldn't need symlinks anymore. We do need to figure out how to run reusable components. Since they are packaged with fondant, we can still reference the main.py locally.

  • Need all requirements from all components or multiple environments

This is indeed the biggest remaining downside, especially also for the reusable components. Maybe fondant can automatically install the requirements in separate environments.