Closed GeorgesLorre closed 10 months ago
Proposition
This compiler is responsible for:
class Compiler:
def __init__(self, pipeline: Pipeline):
self.pipeline = pipeline
def add_metadata(self):
# add metadata for each component
# including the cache key
def validate_pipeline(self):
# run validation checks on the pipeline
# - check for duplicate names
# - check if input/output match
# - validate dependencies
# - ...
self.pipeline.validate()
def compile(self, output_path: str):
# compile the pipeline to a file
self.validate_pipeline()
spec = {} # fixed spec describing a fondant version
for component in pipeline:
self.add_metadata(component)
self.pipeline.save(output_path)
This runner is not generic and needs to be implemented per framework. It is responsible for:
class Runner:
def __init__(self):
self._resolve_imports()
def _resolve_imports(self):
import framework
self.framework = framework
def compile(self, pipeline_spec: str):
# take the pipeline spec and creates
# a spec related to the framework
# this can be saved to a file
def run(self, ref: [pipeline_spec, framework_spec]):
# run the pipeline using the framework
# can compile and run based on the type of input
if pipeline_spec:
self.compile(pipeline_spec)
self.framework.run(framework_spec)
The fondant pipeline spec
Having a representation (yaml, json) of a fondant pipeline would be a nice upgrade and it would help with creating reusable pipelines. I'm not quite sure yet what the format of this should be (maybe IR yaml since it aims to be agnostic)
I think this makes sense.
1) Some runners have very specific settings that are set during the framework's compile. Mainly Docker for now: extra_volumes
, build_args
, would you then move those arguments to the runner instead? I think this would not be a big issue since that's what we currently do when we run fondant run local --extra-volumes <volumne>
.
2) Would we then produce two specs everytime we run a pipeline?
For me it's not clear yet what we want exactly from the compiling:
The only work I would do now, is to hide the compiler from the user. This already is the case for the CLI, but I would align the SDK, so a user only needs to create and use a Runner
. I would also hide the compiled component spec. Either by deleting it again, or storing it in a /tmp
directory (or both).
In interest of keeping this change manageable and avoiding premature optimisation I would indeed not go for a Fondant specific pipeline spec (yet). Once we have a better idea of what we need from a spec like this we can still introduce it.
The compiler now will be quite empty, just some validation and maybe some logic to handle metadata and cache keys. But Like Robbe suggested if we can already make the framework runners nicer to use directly (without the CLI) that is already a win. And then we can implement the Sagemaker runner nicer aswel (since a sagemaker compiler does not make a lot of sense).
So for now the runner will take a Fondant Pipeline object as input and use some methods provided by the compiler.
I think there's more things that compilation should do. The main thing is that it should freeze the pipeline definition. Which means the composition, the arguments, the versions of the components, ...
I believe we should either implement a minimal compiler which does all these things (so it won't be very minimal), or we should postpone any work on this and just remove the current Compiler
from the public interface so we can change it freely in the future.
Running a Fondant pipeline on one of our supported runners now is a 2 step affair:
Compile
located at
/src/fondant/pipeline/compiler.py
We provide an Abstract
Compiler
class that is reimplemented for each framework. Its main input is a fondantPipeline
object and some framework specific arguments. The compiler will/should:Run
located at
/src/fondant/pipeline/runner.py
Again there is an abstract
Runner
class the be implemented for each framework. It's take a file as input this file is the output of the compiler of the same framework. The runner is responsible for the the following:client
(this can be with custom credentials, etc) 3 Submit the specInvoking the compiler/runner can be done one of 2 ways:
CLI
located at
/src/fondant/cli.py
The CLI is there to make it really easy to run a fondant pipeline on a framework. There are separate commands for
compile
andrun
which just calls the above. Note that therun
cli command can execute a compile before running based on wether or not it received a fondant Pipeline or an already compiled spec.Directly
Remarks with the current implementation
compiler
andrunner
require framework specific imports which we only import when actually initialising these classes but we have to do it in 2 places.compiler
or therunner