Closed PhilippeMoussalli closed 1 year ago
Converted to issue so I can comment :)
Thanks @GeorgesLorre for the detailed description!
in order to run the pipeline you need a client that knows how to compile and submit the pipeline (this is only kubeflow now). We remove all references to k8s and kubeflow from ComponentOp and The Pipeline and we make the client responsible of interpreting and running the graph defined in the pipeline
Docker approach
I think I prefer this approach since it replicates more closely how components will actually be run.
How do we coordinate container runs between components using the Docker client approach? do we just run them sequentially? we need a way to catch when a component is finished and when another should start?
We might be able to avoid rebuilding images on every iterations if we volume mount the current version of the code into the image (a bit similar to DevContainers or pip install -e
)
For hardware management, I think this will most likely be restricted to running a component with GPU? In this case, it's quite easy (we can even have it as default). Not sure if we need to restrict the memory
Python approach
I like this one better but indeed you might end up with multiple environments and requirements. Might not be the most error-proof solution but its the simplest.
You mentioned that the pros is that it's easy to run partial pipelines or single components but I think we can achieve the same thing with the docker approach
I think docker
is a reasonable dependency and that the docker approach will be the best choice on the long term as it is the most robust. It's also the most complex approach though, and the Python approach has the big benefit that we can move forward with it quickly.
I would therefore propose to start with the Python approach as a quick fix and work towards the docker-based approach on the longer term.
Some feedback on the cons:
- we expect a fixed folder structure
This is already the case since the command run by Kubeflow
is hardcoded in Fondant.
- we'll need init.py in every folder
We should execute the scripts via subprocess
, which is how they will be run in the pipeline as well. Then we don't need an __init__.py
file.
- how to handle symlinked components
We shouldn't need symlinks anymore. We do need to figure out how to run reusable components. Since they are packaged with fondant, we can still reference the main.py
locally.
- Need all requirements from all components or multiple environments
This is indeed the biggest remaining downside, especially also for the reusable components. Maybe fondant can automatically install the requirements in separate environments.
Problem Statement
For ease of development and getting started with Fondant having a way to easily run a toy pipeline locally would be beneficial. Now we have to start a pipeline on kubeflow to debug it or run it locally with an awkward script.
The current implementation of the Pipeline, Client and ComponentOp are quite kubeflow focused so we will need to abstract some things to make this possible.
ComponentOp
The componentOp or component operation just represents the component_spec and the runtime configuration (hardware specs + arguments)
Pipeline
You can register the components on the pipeline with the needed dependencies to create a graph. The pipeline has all the logic to resolve the graph and validate the components ( do the in and outputs match).
Client
in order to run the pipeline you need a client that knows how to compile and submit the pipeline (this is only kubeflow now)
Proposed Approach
We remove all references to k8s and kubeflow from ComponentOp and The Pipeline and we make the client responsible of interpreting and running the graph defined in the pipeline. That way by switching clients we can resuse pipelines.
I see 2 ways of achieving this and they are not mutual exclusive they just operate on a different level:
Docker
All components are docker images that contain all needed libraries and code to run the component so we could start every component of a pipeline as a docker image. We can control the IO by leveraging volume mounts and we could even apply hardware specifications (https://docs.docker.com/config/containers/resource_constraints/). There is a python library build by docker that we could leverage to manage the running containers (https://github.com/docker/docker-py).
Pro's
Cons
Plain Python
Similar to the script we had before where we run the main.py of every component in sequence and passing the paths along.
Pro's
easy to run partial pipelines or single components
Cons
__init__.py
in every folder---> WE chose to start with the docker implementation since it seems to over more benefits while not being that more complex
Implementation Steps/Tasks
Potential Impact
fondant/pipeline.py
will contain the most changesTesting
Testing the pipeline compiler should be straightforward:
Testing the runner could be more tricky:
Documentation
The local runner should be included in all documentation and should be promoted as a way to get started easily.