unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.26k stars 304 forks source link

Support for `polars` #1064

Closed fzyzcjy closed 4 months ago

fzyzcjy commented 1 year ago

Hi thanks for the lib! I wonder it can support type checking for polars?

FilipAisot commented 12 months ago

Are we starting with this thing? I am ready to do some work! Let's get the ball rolling.

cosmicBboy commented 11 months ago

@FilipAisot yes! I was pulled a different direction for the past few weeks, but will have some bandwidth now to help push this along.

I just made a new polars-dev branch to keep track of all the work for polars support, I'll be pushing up a few changes by the end of this week with stub modules for all the basic pieces needed, then we can divvy up the work across the schema, components, checks, model, and type engine as described here

cosmicBboy commented 11 months ago

Okay, to all the folks interested in contributing to this effort: let's kick-off development work to support polars LazyFrames!

Head over here if you just want to start digging into the code 👉 https://github.com/unionai-oss/pandera/pull/1373

The PR contains basic functionality and unit tests for supporting pl.LazyFrame validation.

Efforts

The major pieces of work are:

  1. Implement a api and backends module for each polars data structure we want to support. Basic pl.LazyFrame support is here: DataFrameSchema, DataFrameSchemaBackend.
  2. Built-in checks: This would cover the currently available built-in checks . See ge check here.
  3. Pandera type system integration: pandera has a type system for machine- and logical- datatypes (see here for details). This will essentially be a mapping between polars datatypes and the pandera standard data types. Since polars uses Arrow a widely-used data type system, it would be a good time to implement this.
  4. Implement DataFrameModel support for LazyFrames. This would allow for the dataclass-like schema definitions for dataframes.
  5. Consolidate DataFrameSchema API: This is sort of a meta task after 1-4 are more complete, but this would involve attempting to create a common, shared DataFrameSchema definition such that a single schema can validate pandas, pyspark, and polars DataFrames (this is something I can own).

For the rest of 1-4, if anyone's down to contribute to one or more of these efforts please say so in the comments below, I can help point you the right direction and discuss (perhaps in discord if you want to sync up there)

Initial Prototype

The PR referenced above currently contains a basic proof of concept.

For now, you can pipe schemas through a query, which implicitly will call ldf.collect() on all of the metadata and data value checks:

import polars as pl
import pandera.polars as pa
from pandera import Check as C

ldf  = pl.DataFrame({"string_col": ["a", "b", "c"], "int_col": [0, 1, 2]}).lazy()

schema = pa.DataFrameSchema(
    {
        "string_col": pa.Column(pl.Utf8),
        "int_col": pa.Column(pl.Int64, C.ge(0)),
    }
)

q = ldf.pipe(schema.validate)
df = q.collect()

Raise error:

invalid_ldf  = pl.DataFrame({"string_col": ["a", "b", "c"], "int_col": [-1, 1, 2]}).lazy()
q = invalid_ldf.pipe(schema.validate, lazy=True)
q.collect()

SchemaErrors: Schema None: A total of 1 errors were found.

shape: (2, 5)
          ┌──────────────┬────────────────┬─────────┬─────────────────────────────┬──────────────┐
          │ failure_case ┆ schema_context ┆ column  ┆ check                       ┆ check_number │
          │ ---          ┆ ---            ┆ ---     ┆ ---                         ┆ ---          │
          │ i64          ┆ str            ┆ str     ┆ str                         ┆ i32          │
          ╞══════════════╪════════════════╪═════════╪═════════════════════════════╪══════════════╡
          │ -1           ┆ Column         ┆ int_col ┆ greater_than_or_equal_to(0) ┆ 0            │
          └──────────────┴────────────────┴─────────┴─────────────────────────────┴──────────────┘

In exploring polar's programming model, there are some cool things we can do with the pandera internals to do things like decoupling validation at query definition time (just checking the column data types) vs query collection time (the data value checks that pandera does). I think this is a great follow-up effort once the basic functionality is implemented.

FilipAisot commented 11 months ago

Happy to be of help @cosmicBboy. Point me in any direction you see fit. We can also discuss it on Discord.

AndriiG13 commented 11 months ago

I would definitely need some time to go over the code to get an understanding, but I'm keen to look into 'Built-in checks'!

ilyanoskov commented 8 months ago

This is very much needed 🙏

cosmicBboy commented 8 months ago

@ilyanoskov heard! I took a few weeks break from pandera, but am back now and will continue work on this

ilyanoskov commented 8 months ago

@cosmicBboy thank you very much for all your amazing work with Pandera!

leycec commented 7 months ago

@beartype lead @leycec here. @beartype has officially supported Pandera for a few release cycles now. We're Team Pandera.

I'm increasingly fielding feature requests like beartype/beartype#329, where users are begging for generic typing of Pandas and Polars DataFrame objects. Polars is rapidly eating Pandas' lunch, thanks to being intrinsically multithreaded and stupidly fast. This is sorta like how JAX rapidly ate NumPy and SciPy's lunch... and for the exact same reason.

tl;dr: When Pandera does this, Pandera wins GitHub. Please win GitHub.

cosmicBboy commented 4 months ago

alright folks! With the docs update PR https://github.com/unionai-oss/pandera/pull/1613 and many bugfixes that were unearthed during the beta, official polars support is ready for prime time 🚀

Gonna cut a 0.19.0 release tonight. I suspect there will be more bugs after this, so please give it a try and report them here!

blais commented 4 months ago

That's really great!

yehoshuadimarsky commented 4 months ago

amazing!

cosmicBboy commented 4 months ago

Here it is: https://github.com/unionai-oss/pandera/releases/tag/v0.19.0 🚀. Again I wanted to thank everyone who contributed PRs, filed bug reports, and provided overall good vibes to supporting this feature 🙂 was super fun for me to learn polars.

Please open bug reports, feature requests, and PRs (especially things that you may want from pandera's existing feature set that isn't currently supported).

kszlim commented 3 months ago

Curious if anyone knows whether https://pandera.readthedocs.io/en/stable/pydantic_integration.html#pydantic-integration is going to be supported for polars and whether there's a tracking issue for that?

cosmicBboy commented 3 months ago

@kszlim this wasn't in scope for the initial integration, but feel free to make an issue!

philiporlando commented 2 months ago

A lot of breaking changes have been introduced in the polars 1.0 release. Are there plans for pandera to support this major release?

cosmicBboy commented 2 months ago

A lot of breaking changes have been introduced in the polars 1.0 release. Are there plans for pandera to support this major release?

We should absolutely support polars 1. Can you make an issue outlining what the breaking changes are with respect to the parts of the api used in pandera?