unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.24k stars 302 forks source link

Support for `polars` #1064

Closed fzyzcjy closed 4 months ago

fzyzcjy commented 1 year ago

Hi thanks for the lib! I wonder it can support type checking for polars?

cosmicBboy commented 1 year ago

Hi @fzyzcjy would love to support polars! Doing so is currently blocked by https://github.com/unionai-oss/pandera/issues/381, which I'm trying to get done ASAP, as it'll unblock support for a lot of different data frameworks, including polars.

fzyzcjy commented 1 year ago

Thanks, looking forward to it!

igmriegel commented 1 year ago

Thanks, looking forward to it!

Me too!!

AndriiG13 commented 1 year ago

Thanks, looking forward to it!

Same here, would also be happy to contribute to this one!

francesco086 commented 1 year ago

Same, happy to help :)

igmriegel commented 1 year ago

Hello, I don't really know if it helps, but I wanted to share this project

https://github.com/kolonialno/patito

They paired Pydantic and Polars, they are offering some functionalities Pandera offers. Maybe we could fork something or use as inspiration?

I'm not really experienced, but I'm wiling to help too. 😃

cosmicBboy commented 1 year ago

hi all! so since merging the pandera internals re-write: https://github.com/unionai-oss/pandera/pull/913

Support for polars is technically unblocked! I'm still working on the docs for extending pandera with custom schema specs and backends, but basically here's a rough roadmap for supporting polars:

Support for pandera[polars]

Support for polars can come in two phases, both of which are actually independent from each other.

  1. Add support for an ibis backend. Their polars support is currently experimental, but this is a high-leverage integration that also supports a bunch of other execution backends. The idea here is that if pandera supports ibis as a backend, users get access to a bunch of other backends such as duckdb, mysql, postgres, etc. in-database validation here we come! 🚀
  2. Support a native pandera polars backend, independent of ibis. This this will be useful if folks don't want to depend on ibis and want to write custom checks with the polars API.

The limitation of (1) would be that if you want to write custom checks, you'd have to do it with the ibis API. With (2), you'd be able to write custom checks (e.g. here) with the polars API.

What if I want to use pandera to validate polars dataframes but don't want the pandas dependency?

Currently pandera has a hard dependency on pandas, which is pretty much ubiquitous in data eng/data science/ML stacks, but in case folks want to use pandera-polars in a limited context (e.g. AWS Lambda) and want to minimize dependencies, there is a longer term plan for this. Basically, we can either:

  1. Do a breaking pandera==1.0.0 release, where users have to explicitly install pandera[pandas] for the pandas DataFrame validation, then organize the library to support for other backends, e.g. pandera[polars], pandera[ibis], etc.
  2. Rip out the pandera.core and pandera.backend modules into an upstream library pandera-core, so that we can create a contrib or plugin package, e.g. pandera-polars doesn't have to depend on pandas, and can be installed independently as pip install pandera-polars

Do any of you have any thoughts on this? @fzyzcjy @igmriegel @AndriiG13 @francesco086

fzyzcjy commented 1 year ago

I use both polars and pandas, so do not have any thoughts - all are totally acceptable. Good job and cannot wait to use it!

gab23r commented 1 year ago

As a user it would nice to have only one package. And this package would have no strict dependances. So I am clearly in favor of option one here! Otherwise we would end up with pandera-polars, pandera-ibis, pandera-vaex, ... And the list will grow, and it will be more complicated to manage from the user perspective.

francesco086 commented 1 year ago

Do any of you have any thoughts on this? @fzyzcjy @igmriegel @AndriiG13 @francesco086

Thinking loud (and please correct me if I am wrong):

  1. Option 1 means all the code will be in one single repository. Option 2 means that there will be many repos, with inter-dependencies (e.g. pandera-pandas will have a dependency on pandera-core as specified in pyproject.toml). This means that the various pandera-*df_engine* will have the possibility to be developed at different speed. If you make a major release of pandera-core you don't need to immediately update all engines, as they can keep relying on the old version. -> Option 2 is more modular
  2. Option 1 means that when I install pandera (without extras) and try do do something that requires an extra, I will get an error informing me that I need to install the extras and I can do it straight-away. With option 2 I could have compatibility issues, if for example pandera-pandas and pandera-polars rely on incompatible versions on pandera-core (or others). -> Option 1 make it easier to work with many df engines in the same venv
  3. Option 2 means smaller codebases that are probably easier to get into and manage (to attract contributors)
  4. Option 2 forces to have pandera-core to have a neat public interface that can be re-used without "hacks"

All in all I am in favor of Option 2. My point 2. above is not very important in my opinion, if you really need to work with two different df engines, you can always do it in two separated venvs.

AndriiG13 commented 1 year ago

As a user I think both phases make sense. As I understand, the ibis support would especially be nice for folks who are using different df engines in their project, since they can reuse checks defined in ibis api across the engines.

At the same time I think it's good to have a Polars native solution.

So I like both, but frankly I'm ignorant to the possible package management implications mentioned by others above.

igmriegel commented 1 year ago

2. Support a native pandera polars backend, independent of ibis. This this will be useful if folks don't want to depend on ibis and want to write custom checks with the polars API.

I think we should not bring a third dependency to be able to serve Polars and I agree with francesco086 considerations about pandera-core.

cosmicBboy commented 1 year ago

Cool, thanks for the discussion all!

So re: the polars-support roadmap, I'll plan on working on the ibis backend integration as a n=2 sample for how well the pandera core/backend abstractions fit into supporting another non-pandas-API framework.

Help Needed!

Will definitely need some help designing/implementing the polars-native backend: will need to ramp up on the python polars API myself, but would anyone on this thread be willing to help out?

Design

  1. assessing whether the attributes of DataFrameSchema and Columns fit with the polars API. I'm aware that it doesn't have a notion of Index (which sounds awesome actually 😎) but if there's anything besides columns that need to be validated in a polars dataframe that would be good to know
  2. assessing whether the backend model fits into how polars works

Implementation

Eventually will also need help implementing:

Please give a 👍 to this comment if you'll be able to help with one or more of the above

StefanBRas commented 1 year ago

Polars themselves ship data synthesis functions for use with hypothesis: Api reference link.

ritchie46 commented 1 year ago

One thing that would also be cool is validating polars LazyFrames. A LazyFrame is a promise of a computation/query plan. But polars knows the schema for every step in the plan, so we can validate before running the queries. I think this can be very valuable in ETL. Your pipeline is validated before it runs, and not 20mins in.

francesco086 commented 1 year ago

Just want to mention that I really would like to help, but I am not familiar with polars (yet). So I think in this first phase I am probably not useful. I am very much willing to learn what is needed and implement following your directions :) (please use me!)

ritchie46 commented 1 year ago

Just want to mention that I really would like to help, but I am not familiar with polars (yet). So I think in this first phase I am probably not useful. I am very much willing to learn what is needed and implement following your directions :) (please use me!)

If you are familiar with pandera. Please join our discord. We can open a pandera thread and we can help one snippet at a time.

francesco086 commented 1 year ago

One thing that would also be cool is validating polars LazyFrames. A LazyFrame is a promise of a computation/query plan. But polars knows the schema for every step in the plan, so we can validate before running the queries. I think this can be very valuable in ETL. Your pipeline is validated before it runs, and not 20mins in.

@ritchie46 One important aspect to keep in mind is that pandera has schema models, which is much more than column names and types. For example, a pandera schema could describe and check the constraint col_a + col_b = col_c. So I am not sure about the LazyFrames, I don't think that it is possible to validate them before actually doing a computation.

cosmicBboy commented 1 year ago

One thing that would also be cool is validating polars LazyFrames. A LazyFrame is a promise of a computation/query plan. But polars knows the schema for every step in the plan, so we can validate before running the queries.

I think this will be very valuable type checking and perhaps other dataframe metadata, though a limitation would be that it wouldn't be able to apply checks on actual values (e.g. pandera.Check.ge(0)) before running the code (unless I'm missing something conceptually). This is fine, I think, as long as the UX for applying pandera schemas to LazyFrames is clarified to be only on dataframe metadata (data types, column names, etc.)

Regardless, LazyFrame validation would definitely a huge plus!

kuatroka commented 1 year ago
  1. Support a native pandera polars backend, independent of ibis. This this will be useful if folks don't want to depend on ibis and want to write custom checks with the polars API.

If you look to future-proof pandera, expand the number of users and be strategic about it, the way to go would be to go full throttle on an independent polars support and prioritise it over ibis. ibis project is roughly 9 years old and has 2.5K stars. polars is roughly 3 years old and 14.3K stars. Many people say stars are a vanity metrics, but I disagree. They are still metrics and they do show what is being used more.

To be clear, ibis is great, polars is great, but if it's about sequencing new feature development and effort allocation per unit of an immediate usefulness and reaching the widest possible user number, I'd suggest to go for an independent full polars support first.

P.D. pandera is just awesome.

cosmicBboy commented 1 year ago

@kuatroka good feedback!

My short-term priority is still to add an ibis backend, since the motivation there is to be able to enable in-database validation for a number of supported DBs (postgres, mysql, etc). Support for this has been in demand for a while now, the nice side-effect is that it adds (experimental) support for polars.

That said, I'm all for polars support. Would love community contributions on this, but I owe all of you a comprehensive set of docs first on how to extend pandera with your own schema specification and backends (I'm working on this now!)

the-matt-morris commented 1 year ago

Jumping in a little late here, but as a user of both pandera and polars (and love both libraries), I'd be willing to contribute to make this happen so I don't have to also add pandas as a dependency in my pipelines and perform the validation portion on pandas dataframes!

blais commented 1 year ago

Adding a +1 for Polars schemas!

kykyi commented 1 year ago

@cosmicBboy keen to help 🚀

Sounds like you are prioritising the ibis integration which I'd be keen to look at, do you have any workings on this yet?

Or if you are wanting to focus on ibis I could start spiking out what an independent polars module could look like 👌

kykyi commented 1 year ago

If I understand correctly, it will be a matter of filling out the yellow PolarsSchemaBackend and IbisBacked branches? image

cosmicBboy commented 1 year ago

@kykyi correct! There's a WIP PR for adding pyspark.sql native support: https://github.com/unionai-oss/pandera/pull/1213

We'll basically need to do the same for polars and ibis.

kykyi commented 1 year ago

Nice, I may wait until #1213 is merged as it looks close and can follow the conventions it introduces rather than going about implementing polars from scratch. Is that reasonable @cosmicBboy?

cosmicBboy commented 1 year ago

Polars mini-roadmap 🛣

In order to support a pandera polars backend, we'll need to implement the following:

Each of the modules described above has analogues in the pandas implementation. This is just an initial draft, will fill in more details in the next few days.

kykyi commented 1 year ago

pandera.engines.polars_engine seems straightforward and there is a lot of convention to follow so could be a good start. I'll get cracking!

Benjamin-T commented 1 year ago

Great that this is picked up!

cosmicBboy commented 1 year ago

thanks @kykyi ! once 0.16.0 is out I'll be able to turn my attention to stubbing out some of the pieces we need on the schema api and backend side of things.

cosmicBboy commented 1 year ago

paging folks who 👍'd this comment

@StefanBRas @AndriiG13 @kuatroka @vrd83 @FilipAisot calling for help on this effort! See https://github.com/unionai-oss/pandera/issues/1064#issuecomment-1584655803 for the mini-roadmap for adding polars support.

I'll create issues for these over the next week, just wanted to ping y'all to see how many hands we'll have in this effort.

FilipAisot commented 1 year ago

@cosmicBboy I am willing to help so count me in.

BartHelder commented 1 year ago

@cosmicBboy This looks great! Polars support in pandera is something we could really use at work, so there's a good chance we could take up a couple of issues as an internal project.

cosmicBboy commented 1 year ago

so there's a good chance we could take up a couple of issues as an internal project.

Amazing @BartHelder ! I'll post a link here at EOW with a proper roadmap with issues linked for folks to pick up.

AndriiG13 commented 1 year ago

I'll create issues for these over the next week, just wanted to ping y'all to see how many hands we'll have in this effort.

@cosmicBboy very happy to contribute!

StefanBRas commented 1 year ago

@cosmicBboy Looks good! What is the plan regarding LazyFrame versus DataFrame? From the mini-roadmap it's not not immediately clear to me.

cosmicBboy commented 1 year ago

@StefanBRas I have a few questions:

  1. How do people use LazyFrame vs DataFrame? In terms of polars user code, that is. It seems LazyFrame has many advantages over DataFrame, so what are the trade-offs?
  2. Is the API the same? what are the differences, if not?

Depending on the answer to these two questions it may or may not matter a lot in terms of the pandera implementation.

FilipAisot commented 1 year ago

@StefanBRas I have a few questions:

  1. How do people use LazyFrame vs DataFrame? In terms of polars user code, that is. It seems LazyFrame has many advantages over DataFrame, so what are the trade-offs?
  2. Is the API the same? what are the differences, if not?

Depending on the answer to these two questions it may or may not matter a lot in terms of the pandera implementation.

If I may try to answer from my perspective.

  1. A LazyFrame is a representation of a Lazy computation graph/query against a DataFrame. LazyFrames are utilized when there's a need to avoid loading entire datasets into memory, either due to their large size or for efficiency reasons. With polars, LazyFrames offer the advantage of validating query correctness against DataFrame schema before data is loaded into memory, preventing potential errors. However, the tradeoff is that LazyFrames only provide information about data types, limiting direct inspection of the content of the DataFrame. Nevertheless, the benefits include increased efficiency, faster processing, and the ability to handle data larger than memory in a streaming fashion.
  2. The API for LazyFrames shares many similarities with existing functionalities for DataFrames. However, certain features available in DataFrames are not present in LazyFrames due to their inherent nature of not providing direct access to the exact contents of the dataframe. This limitation stems from the design choice of prioritizing memory efficiency over full data visibility. Consequently, some operations or functionalities that require direct knowledge of the data may not be available or may exhibit different behavior when working with LazyFrames.

This is my take on the questions you asked purely from a users perspective @cosmicBboy. I am not really familiar enough with pandera internals to be able to give a verdict of how necessary including LazyFrames is. Maybe only to be able to enforce structure upon a schema. Validating streaming data might be a bit of a headache.

Disclaimer: I might miss some points so please point out anything that I missed.

cosmicBboy commented 1 year ago

Thanks for the detailed explanation!

I think the order of operations would be:

From a pandera perspective, all it's doing is:

As long as the LazyFrame API supports these two things the pandera implementation.

Pandera already supports pyspark.sql, which I believe at a high-level also creates a query plan and query execution, so I'm fairly optimistic this can be done with polars.

In the end, the pandera validation routine would be:

actual_dataframe = (
    pl.scan_csv(...)
    ...  # a bunch of checks
    .collect()
)

If validation fails it'll either raise a SchemaError(s) exception or add a pandera.errors attribute in the dataframe object as described here.

StefanBRas commented 1 year ago

@cosmicBboy

  1. My impression as just a standard user is that LazyFrame are most used. Using a LazyFrame will often give you major performance improvements and It's free to make a DataFrame into a LazyFrame - it's just a df.lazy() call. A lot of the methods on the DataFrame class itself casts to lazy and then collect after.
  2. Api is largely similar, some methods does not exist on LazyFrame and requires to collect the LazyFrame into a DataFrame.

I'm not sure if Pandera already does this, but I think there needs to be a way to explicitly distinguish between checks that requires a LazyFrame to be collected into a DataFrame and checks that does not. Any check that needs to look at the actual data of a column will need to collect first.

You could consider having the interface be identical for both LazyFrame and DataFrame. Then everything that applies to both can be implemented for LazyFrame and then DataFrames can just be cast to LazyFrames.

gwerbin commented 1 year ago

@StefanBRas for whatever my experience is worth, when doing data analysis and early-stage development I tend to use the non-lazy version for easy debugging and building out test cases. Then I refactor to the lazy version once I know how it all fits together. So it would be nice if at least a nontrivial common subset of Pandera worked seamlessly between both lazy and non-lazy versions.

FilipAisot commented 1 year ago

@cosmicBboy

  1. My impression as just a standard user is that LazyFrame are most used. Using a LazyFrame will often give you major performance improvements and It's free to make a DataFrame into a LazyFrame - it's just a df.lazy() call. A lot of the methods on the DataFrame class itself casts to lazy and then collect after.
  2. Api is largely similar, some methods does not exist on LazyFrame and requires to collect the LazyFrame into a DataFrame.

I'm not sure if Pandera already does this, but I think there needs to be a way to explicitly distinguish between checks that requires a LazyFrame to be collected into a DataFrame and checks that does not. Any check that needs to look at the actual data of a column will need to collect first.

You could consider having the interface be identical for both LazyFrame and DataFrame. Then everything that applies to both can be implemented for LazyFrame and then DataFrames can just be cast to LazyFrames.

I agree, checks that require data need to explicitly state that the data will be collected. This way we avoid issues with larger than memory datasets.

cosmicBboy commented 1 year ago

So in terms of order of implementation, would it make sense to support LazyFrame first, and then simply call .lazy() on DataFrames under the hood?

StefanBRas commented 1 year ago

In my opinion yes, but with the caveat that I haven't worked with Pandera myself.

lior5654 commented 1 year ago

Important Note:

I think the DataFrameModel definition should be as agnostic as possible to the dataframe library used.

This would allow writing a schema once, and then one can seamlessly switch between pandas, polars, pyspark, dask etc'.

Note: Of course, except "edge cases" (indicies, struct types, etc').

cosmicBboy commented 1 year ago

This would allow writing a schema once, and then one can seamlessly switch between pandas, polars, pyspark, dask etc'.

I think this is a worthy goal, barring a few technical challenges on making this all work nice with multiple dataframe generic types, see this issue.

For now, though, each library can get its own DataFrameModel type, which would can eventually all merge together for the DataFrameModel to rule them all.

rmorshea commented 1 year ago

Haven't read through this whole conversation, but I wanted to drop a link to this DataFrame API standard in case it hadn't been mentioned and, if it hadn't, so that it might help in creating "one DataFrameModel to rule them all".

cosmicBboy commented 11 months ago

@rmorshea I've been keeping tabs on that project! How mature would you say it is i.e. is it ready for prime time?

rmorshea commented 11 months ago

According to the README it's not out of the draft stage. This issue from 3 weeks ago seems to suggest that things haven't quite crystalized, but it'd probably be best to ask the folks driving the project forward what the status is. If people from Pandera feel like they have a vested interest in a standard like that, I'm sure it would benefit from more contributors.