Open Jason94 opened 8 months ago
TBH I'm not sure I understand the concept of how Parsons could implement orchestration. As I think of it, orchestration really requires cloud infrastructure to be provisioned and configured, including code storage in the cloud (dockerizing and pushing to a docker store or copying code to s3), cloud compute, cloud secret storage for access in production, a healthy layer of IAM roles for development access and appropriately scoped execution privileges, billing information / a credit card on file, etc. etc.
For the Prefect example, wrapping a python script in @prefect.flow
doesn't actually implement "orchestration," it just means that if that script is run, it will be logged in Prefect Cloud (if a Prefect Cloud account exists and appropriate API keys are set in the environment). Orchestration would also involve bundling the script as a Prefect deployment with a schedule and setting up a cloud execution layer (Prefect doesn't run an execution layer like some other orchestration platforms like Airflow does, it leaves that up to the user to set up).
Most of this feels outside the scope of what a python package (Parsons) can really implement
Austin, those are some good points. Here is what the current Prefect implementation provides and what it doesn't:
Here is what it doesn't provide:
Here is what it does:
I'm not convinced that it couldn't help with scheduling and some of that other stuff, although it'd be pretty dependent on whatever plugins we built. An option I looked into was Apache Airflow, which would be possible to integrate in a similar way to how Prefect is currently handled, I think.
I think you raise three really good questions for this part of the design:
Overview
The Pipelines project targets two user groups. One of them are advanced users who are already fluent in Python. One of the main value-add features of pipelines to advanced users is easy data orchestration integration. Data orchestration gives many benefits, such as error logging, data visibility, etc. It is a key goal of the pipelines system that you get drop-in data orchestration of your pipelines "for free."
Currently (2/27/2024) the pipelines branch has hard-coded Prefect integration. This is a good proof of concept, as the Prefect integration is entirely behind the scenes. However, because Prefect is closed source and cloud based, it's not acceptable to lock pipelines into that tool.
Discussion
The initial goal of this discussion is to gather input from the community about:
Once we have collected data about a wide variety of tools, we will design an abstraction that allows the pipelines system to work with as many data orchestration tools as possible. Then, data orchestration "plugins" that target the abstraction can be added either inside or outside of Parsons, allowing pipelines to be used with any data orchestration platform.
Without a thorough discussion of different data orchestration use cases, we risk designing an abstraction that cannot accommodate many of the tools that pipelines users will want to target in their code.