Open cosmicBboy opened 1 year ago
Hey! I brought this up on reddit, would be happy to help :)
Hi @erichamers ! Awesome that would be amazing ✨
Are you familiar with the process of creating plugins for Airflow? I think good first steps would be:
I can take on (1), do you mind taking on (2)?
No, I'm not familiar with the process for creating airflow plugins. Yes, I can work on (2) 😃
Cool! I see: https://github.com/erichamers/airflow-provider-pandera
@erichamers let's create a design doc for this, I created a stub doc here: https://www.notion.so/Design-Doc-Airflow-Pandera-Provider-a352cc3c49844a0dbacff16ba40ff079
I just added you with edit permissions to that doc.
Admittedly I'm not a huge airflow user so I'd appreciate if you can help design the user-facing API. I have a few (naive) questions about how airflow works, we can take the discussion over to the notion doc.
I have a decent amount of airflow experience due to day-to-day work, hopefully I'll be able to help. Will follow up on notion. @cosmicBboy
opened up a discussion in the airflow dev mailing list: https://lists.apache.org/thread/qk2co6trd7gm57744shprw2fhgmjr637
@erichamers so it's looking likely that and pandera provider will be hosted in a non-airflow-maintained repo... would you be opening to moving the airflow-provider-pandera repo over to the unionai-oss org at some point, perhaps once we have an MVP?
We'll of course credit you as one of the core authors of this provider :)
Of course! i'll try to work on it some more this week, been kinda busy at work, do you have any kind of release date in mind for this? @cosmicBboy
awesome, thanks @erichamers! We can do an alpha release 0.0.1
once we have something that just validates a pandas dataframe. This'll help us get all the CI/CD infra in place and put us in a good spot to iterate, I'm hoping we can launch this in early January of next year.
However as the airflow devs point out, I think the value that would justify a pandera provider would be to handle loading data from some remote source (DB or blob store), parses + validates, and optionally uploads the data to a target destination. I think that would be a 0.1.0
release that we can aim for end of Q1 2023, which will come with:
That makes sense and the timeline seems attainable.
Is your feature request related to a problem? Please describe.
Pandera currently has integrations with two orchestrators: Flyte and Dagster. Airflow is currently unsupported, and its users would benefit from a simple, extensible data validation toolkit provided by pandera.
Describe the solution you'd like
An Airflow operator packaged as a plugin where users can specify the pandera dataframe they want to use for some pandera-supported data structure passing through an Airflow DAG.
Describe alternatives you've considered
Letting users roll their own custom pandera operator? 🤷♂️
Additional context
This was brought up in a reddit post: https://www.reddit.com/r/dataengineering/comments/z90wtm/comment/iyh5e3w/?utm_source=share&utm_medium=web2x&context=3