unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.41k stars 311 forks source link

Pandera Airflow Operator #1042

Open cosmicBboy opened 1 year ago

cosmicBboy commented 1 year ago

Is your feature request related to a problem? Please describe.

Pandera currently has integrations with two orchestrators: Flyte and Dagster. Airflow is currently unsupported, and its users would benefit from a simple, extensible data validation toolkit provided by pandera.

Describe the solution you'd like

An Airflow operator packaged as a plugin where users can specify the pandera dataframe they want to use for some pandera-supported data structure passing through an Airflow DAG.

Describe alternatives you've considered

Letting users roll their own custom pandera operator? 🤷‍♂️

Additional context

This was brought up in a reddit post: https://www.reddit.com/r/dataengineering/comments/z90wtm/comment/iyh5e3w/?utm_source=share&utm_medium=web2x&context=3

erichamers commented 1 year ago

Hey! I brought this up on reddit, would be happy to help :)

cosmicBboy commented 1 year ago

Hi @erichamers ! Awesome that would be amazing ✨

Are you familiar with the process of creating plugins for Airflow? I think good first steps would be:

  1. reach out to the Airflow community about a feature request to see if the pandera plugin can be hosted in the official provider repository.
  2. design/create a working prototype for the airflow-pandera-provider.

I can take on (1), do you mind taking on (2)?

erichamers commented 1 year ago

No, I'm not familiar with the process for creating airflow plugins. Yes, I can work on (2) 😃

cosmicBboy commented 1 year ago

Cool! I see: https://github.com/erichamers/airflow-provider-pandera

@erichamers let's create a design doc for this, I created a stub doc here: https://www.notion.so/Design-Doc-Airflow-Pandera-Provider-a352cc3c49844a0dbacff16ba40ff079

I just added you with edit permissions to that doc.

Admittedly I'm not a huge airflow user so I'd appreciate if you can help design the user-facing API. I have a few (naive) questions about how airflow works, we can take the discussion over to the notion doc.

erichamers commented 1 year ago

I have a decent amount of airflow experience due to day-to-day work, hopefully I'll be able to help. Will follow up on notion. @cosmicBboy

cosmicBboy commented 1 year ago

opened up a discussion in the airflow dev mailing list: https://lists.apache.org/thread/qk2co6trd7gm57744shprw2fhgmjr637

cosmicBboy commented 1 year ago

@erichamers so it's looking likely that and pandera provider will be hosted in a non-airflow-maintained repo... would you be opening to moving the airflow-provider-pandera repo over to the unionai-oss org at some point, perhaps once we have an MVP?

We'll of course credit you as one of the core authors of this provider :)

erichamers commented 1 year ago

Of course! i'll try to work on it some more this week, been kinda busy at work, do you have any kind of release date in mind for this? @cosmicBboy

cosmicBboy commented 1 year ago

awesome, thanks @erichamers! We can do an alpha release 0.0.1 once we have something that just validates a pandas dataframe. This'll help us get all the CI/CD infra in place and put us in a good spot to iterate, I'm hoping we can launch this in early January of next year.

However as the airflow devs point out, I think the value that would justify a pandera provider would be to handle loading data from some remote source (DB or blob store), parses + validates, and optionally uploads the data to a target destination. I think that would be a 0.1.0 release that we can aim for end of Q1 2023, which will come with:

erichamers commented 1 year ago

That makes sense and the timeline seems attainable.