[Pipelines] Design - Data orchestration input

Jason94 commented 8 months ago

Overview

The Pipelines project targets two user groups. One of them are advanced users who are already fluent in Python. One of the main value-add features of pipelines to advanced users is easy data orchestration integration. Data orchestration gives many benefits, such as error logging, data visibility, etc. It is a key goal of the pipelines system that you get drop-in data orchestration of your pipelines "for free."

Currently (2/27/2024) the pipelines branch has hard-coded Prefect integration. This is a good proof of concept, as the Prefect integration is entirely behind the scenes. However, because Prefect is closed source and cloud based, it's not acceptable to lock pipelines into that tool.

Discussion

The initial goal of this discussion is to gather input from the community about:

What data orchestration tools you use
How you use those data orchestration tools
How your code interacts with those data orchestration tools.

Once we have collected data about a wide variety of tools, we will design an abstraction that allows the pipelines system to work with as many data orchestration tools as possible. Then, data orchestration "plugins" that target the abstraction can be added either inside or outside of Parsons, allowing pipelines to be used with any data orchestration platform.

Without a thorough discussion of different data orchestration use cases, we risk designing an abstraction that cannot accommodate many of the tools that pipelines users will want to target in their code.

austinweisgrau commented 8 months ago

TBH I'm not sure I understand the concept of how Parsons could implement orchestration. As I think of it, orchestration really requires cloud infrastructure to be provisioned and configured, including code storage in the cloud (dockerizing and pushing to a docker store or copying code to s3), cloud compute, cloud secret storage for access in production, a healthy layer of IAM roles for development access and appropriately scoped execution privileges, billing information / a credit card on file, etc. etc.

For the Prefect example, wrapping a python script in @prefect.flow doesn't actually implement "orchestration," it just means that if that script is run, it will be logged in Prefect Cloud (if a Prefect Cloud account exists and appropriate API keys are set in the environment). Orchestration would also involve bundling the script as a Prefect deployment with a schedule and setting up a cloud execution layer (Prefect doesn't run an execution layer like some other orchestration platforms like Airflow does, it leaves that up to the user to set up).

Most of this feels outside the scope of what a python package (Parsons) can really implement

Jason94 commented 8 months ago

Austin, those are some good points. Here is what the current Prefect implementation provides and what it doesn't:

Here is what it doesn't provide:

Providing a cloud platform to execute your code (Civis, Airflow like you said)
Automatically scheduling your jobs to run, either locally or via the cloud
Containerize anything or manage the environment (Personally I would consider this "dev ops" not "data orchestration", but it doesn't really matter)

Here is what it does:

Providing visibility into what scripts are running, what those scripts are doing via a cloud interface
Provide better visibility into where, when, and why errors occurred.

I'm not convinced that it couldn't help with scheduling and some of that other stuff, although it'd be pretty dependent on whatever plugins we built. An option I looked into was Apache Airflow, which would be possible to integrate in a similar way to how Prefect is currently handled, I think.

I think you raise three really good questions for this part of the design:

Maybe there is a better name for what we're talking about here than "data orchestration".
What set of "data orchestration" features do our users want? Austin, would some of the things you mentioned, like scheduling or interacting with cloud infrastructure would you ideally like to see?
Based on the answer to 2, what are some tools we could build off of that could provide more of those than what the current Prefect integration is giving? As I mentioned, Apache Airflow is a high-power library that could be of use. There's also probably more to Prefect that I haven't explored. What's currently in there is more of a proof-of-concept than anything. If someone's used Prefect more than I have, it'd be great if they could weigh in on what some of these kinds of tasks Prefect is and isn't capable of handling.

move-coop / parsons

[Pipelines] Design - Data orchestration input #1005

Overview

Discussion