Create a local pipeline runner leveraging Docker

PhilippeMoussalli commented 1 year ago

Problem Statement

For ease of development and getting started with Fondant having a way to easily run a toy pipeline locally would be beneficial. Now we have to start a pipeline on kubeflow to debug it or run it locally with an awkward script.

The current implementation of the Pipeline, Client and ComponentOp are quite kubeflow focused so we will need to abstract some things to make this possible.

ComponentOp

The componentOp or component operation just represents the component_spec and the runtime configuration (hardware specs + arguments)

Pipeline

You can register the components on the pipeline with the needed dependencies to create a graph. The pipeline has all the logic to resolve the graph and validate the components ( do the in and outputs match).

Client

in order to run the pipeline you need a client that knows how to compile and submit the pipeline (this is only kubeflow now)

Proposed Approach

We remove all references to k8s and kubeflow from ComponentOp and The Pipeline and we make the client responsible of interpreting and running the graph defined in the pipeline. That way by switching clients we can resuse pipelines.


#PSEUDOCODE

from fondant.pipeline import ComponentOp, KubeflowClient, LocalClient, Pipeline

def build_pipeline():
    pipeline = Pipeline(pipeline_name="example pipeline", base_path="fs://bucket")

    load_from_hub_op = ComponentOp.from_registry(
        name="load_from_hf_hub",
        arguments={"dataset_name": "lambdalabs/pokemon-blip-captions"},
    )
    pipeline.add_op(load_from_hub_op)

    caption_images_op = ComponentOp(
        component_spec_path="components/captioning_component/fondant_component.yaml",
        arguments={
            "batch_size": 2,
            "max_new_tokens": 50,
        },
        number_of_gpus=1,
        node_pool_name="model-inference-pool",
    )
    pipeline.add_op(caption_images_op, dependencies=load_from_hub_op)
    return pipeline

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="dataflow_boilerplate")
    parser.add_argument(
        "-m",
        "--mode",
        help="Mode to run pipeline in.",
        choices=["local", "kubeflow"],
        default="local",
    )
    args = parser.parse_args()
    if args["local"]:
        client = LocalClient()
    else:
        client = KubeflowClient(host="https://kfp-host.com/")
    pipeline = build_pipeline()
    client.compile_and_run(pipeline=pipeline)

I see 2 ways of achieving this and they are not mutual exclusive they just operate on a different level:

Docker

All components are docker images that contain all needed libraries and code to run the component so we could start every component of a pipeline as a docker image. We can control the IO by leveraging volume mounts and we could even apply hardware specifications (https://docs.docker.com/config/containers/resource_constraints/). There is a python library build by docker that we could leverage to manage the running containers (https://github.com/docker/docker-py).


#PSEUDOCODE

import docker

def DockerClient(Client):
    def __init__(self, storage_dir: str):
        self.client = docker.from_env()
        self.storage_dir = storage_dir

    def compile_and_run(self, pipeline):
        run_volume = client.volumes.create(name="foobar", path=self.storage_dir)

        for component in pipeline.components():
            # TODO: add some logging
            container = self.client.containers.run(
                component.spec.image, "python main.py"
                volume=run_volume
            )
            for line in container.logs(stream=True):
                # TODO: do some error catching ?
                print(line.strip())

Pro's

Run same docker images
all requirements are installed per image
Client implementation seems straightforward
Hardware management
can work with symlinked components
could also be ran on VM's

Cons

Slow ?
Need docker
need to test volume mounting
rebuild images for iterations

Plain Python

Similar to the script we had before where we run the main.py of every component in sequence and passing the paths along.


#PSEUDOCODE

import importlib

def DirectClient(Client):
    def __init__(self):
        pass
    def compile_and_run(self, pipeline):
        # we'll assume the image name is the folder name and the calling file is next to the components
        # .
        # ├── README.md
        # ├── build_images.sh
        # ├── components
        # │   ├── caption_images
        # │   │   └── src
        # │   │       └── main.py
        # │   ├── download_images
        # │   │   └── src
        # │   │       └── main.py
        # │   ├── generate_prompts
        # │   │   └── src
        # │   │       └── main.py
        # └── pipeline.py

        # probably need some arg setup for the manifest paths and component arguments
        for step in pipeline.components():
            component = importlib.import_module("component", f"components.{step.name}.main")
            component.run()

Pro's

Fast (on toy datasets)
No need for hardware specs
easy to run partial pipelines or single components

Cons
we expect a fixed folder structure
we'll need __init__.py in every folder
how to handle symlinked components
Need all requirements from all components or multiple environments

---> WE chose to start with the docker implementation since it seems to over more benefits while not being that more complex

Implementation Steps/Tasks

### Tasks
- [ ] #184
- [ ] https://github.com/ml6team/fondant/issues/185
- [ ] https://github.com/ml6team/fondant/issues/186
- [ ] https://github.com/ml6team/fondant/issues/187
- [ ] https://github.com/ml6team/fondant/issues/188
- [ ] https://github.com/ml6team/fondant/issues/189
- [ ] https://github.com/ml6team/fondant/issues/190

Potential Impact

fondant/pipeline.py will contain the most changes
All example pipelines could be affected

Testing

Testing the pipeline compiler should be straightforward:

Pipeline in -> spec out
if we have validation we can add tests for validating dependencies

Testing the runner could be more tricky:

mock ?

Documentation

The local runner should be included in all documentation and should be promoted as a way to get started easily.

PhilippeMoussalli commented 1 year ago

Converted to issue so I can comment :)

Thanks @GeorgesLorre for the detailed description!

in order to run the pipeline you need a client that knows how to compile and submit the pipeline (this is only kubeflow now). We remove all references to k8s and kubeflow from ComponentOp and The Pipeline and we make the client responsible of interpreting and running the graph defined in the pipeline

Compiling the pipeline can be done without a client (only relies on having kfp) -> can still be used to validate that all the required arguments exists for every components. This produces a yaml file that is passed to the client to run/upload the pipeline to kubeflow. Not sure if we need to remove the kubeflow dependency there because then you'll have to to implement the argument checking yourself.

Docker approach

I think I prefer this approach since it replicates more closely how components will actually be run.

How do we coordinate container runs between components using the Docker client approach? do we just run them sequentially? we need a way to catch when a component is finished and when another should start?
We might be able to avoid rebuilding images on every iterations if we volume mount the current version of the code into the image (a bit similar to DevContainers or pip install -e)
For hardware management, I think this will most likely be restricted to running a component with GPU? In this case, it's quite easy (we can even have it as default). Not sure if we need to restrict the memory

Python approach

I like this one better but indeed you might end up with multiple environments and requirements. Might not be the most error-proof solution but its the simplest.

You mentioned that the pros is that it's easy to run partial pipelines or single components but I think we can achieve the same thing with the docker approach

RobbeSneyders commented 1 year ago

I think docker is a reasonable dependency and that the docker approach will be the best choice on the long term as it is the most robust. It's also the most complex approach though, and the Python approach has the big benefit that we can move forward with it quickly.

I would therefore propose to start with the Python approach as a quick fix and work towards the docker-based approach on the longer term.

Some feedback on the cons:

we expect a fixed folder structure

This is already the case since the command run by Kubeflow is hardcoded in Fondant.

we'll need init.py in every folder

We should execute the scripts via subprocess, which is how they will be run in the pipeline as well. Then we don't need an __init__.py file.

how to handle symlinked components

We shouldn't need symlinks anymore. We do need to figure out how to run reusable components. Since they are packaged with fondant, we can still reference the main.py locally.

Need all requirements from all components or multiple environments

This is indeed the biggest remaining downside, especially also for the reusable components. Maybe fondant can automatically install the requirements in separate environments.

ml6team / fondant