Closed fzyzcjy closed 4 months ago
Hi @fzyzcjy would love to support polars! Doing so is currently blocked by https://github.com/unionai-oss/pandera/issues/381, which I'm trying to get done ASAP, as it'll unblock support for a lot of different data frameworks, including polars.
Thanks, looking forward to it!
Thanks, looking forward to it!
Me too!!
Thanks, looking forward to it!
Same here, would also be happy to contribute to this one!
Same, happy to help :)
Hello, I don't really know if it helps, but I wanted to share this project
https://github.com/kolonialno/patito
They paired Pydantic and Polars, they are offering some functionalities Pandera offers. Maybe we could fork something or use as inspiration?
I'm not really experienced, but I'm wiling to help too. 😃
hi all! so since merging the pandera internals re-write: https://github.com/unionai-oss/pandera/pull/913
Support for polars is technically unblocked! I'm still working on the docs for extending pandera with custom schema specs and backends, but basically here's a rough roadmap for supporting polars:
pandera[polars]
Support for polars
can come in two phases, both of which are actually independent from each other.
duckdb
, mysql
, postgres
, etc. in-database validation here we come! 🚀polars
backend, independent of ibis
. This this will be useful if folks don't want to depend on ibis
and want to write custom checks with the polars API.The limitation of (1) would be that if you want to write custom checks, you'd have to do it with the ibis API. With (2), you'd be able to write custom checks (e.g. here) with the polars API.
Currently pandera has a hard dependency on pandas, which is pretty much ubiquitous in data eng/data science/ML stacks, but in case folks want to use pandera-polars in a limited context (e.g. AWS Lambda) and want to minimize dependencies, there is a longer term plan for this. Basically, we can either:
pandera==1.0.0
release, where users have to explicitly install pandera[pandas]
for the pandas DataFrame validation, then organize the library to support for other backends, e.g. pandera[polars]
, pandera[ibis]
, etc.pandera.core
and pandera.backend
modules into an upstream library pandera-core
, so that we can create a contrib
or plugin
package, e.g. pandera-polars
doesn't have to depend on pandas, and can be installed independently as pip install pandera-polars
Do any of you have any thoughts on this? @fzyzcjy @igmriegel @AndriiG13 @francesco086
I use both polars and pandas, so do not have any thoughts - all are totally acceptable. Good job and cannot wait to use it!
As a user it would nice to have only one package. And this package would have no strict dependances. So I am clearly in favor of option one here! Otherwise we would end up with pandera-polars
, pandera-ibis
, pandera-vaex
, ... And the list will grow, and it will be more complicated to manage from the user perspective.
Do any of you have any thoughts on this? @fzyzcjy @igmriegel @AndriiG13 @francesco086
Thinking loud (and please correct me if I am wrong):
pandera-pandas
will have a dependency on pandera-core
as specified in pyproject.toml
). This means that the various pandera-*df_engine*
will have the possibility to be developed at different speed. If you make a major release of pandera-core
you don't need to immediately update all engines, as they can keep relying on the old version. -> Option 2 is more modularpandera
(without extras) and try do do something that requires an extra, I will get an error informing me that I need to install the extras and I can do it straight-away. With option 2 I could have compatibility issues, if for example pandera-pandas
and pandera-polars
rely on incompatible versions on pandera-core
(or others). -> Option 1 make it easier to work with many df engines in the same venvpandera-core
to have a neat public interface that can be re-used without "hacks"All in all I am in favor of Option 2. My point 2. above is not very important in my opinion, if you really need to work with two different df engines, you can always do it in two separated venvs.
As a user I think both phases make sense. As I understand, the ibis support would especially be nice for folks who are using different df engines in their project, since they can reuse checks defined in ibis api across the engines.
At the same time I think it's good to have a Polars native solution.
So I like both, but frankly I'm ignorant to the possible package management implications mentioned by others above.
2. Support a native pandera
polars
backend, independent ofibis
. This this will be useful if folks don't want to depend onibis
and want to write custom checks with the polars API.
I think we should not bring a third dependency to be able to serve Polars and I agree with francesco086 considerations about pandera-core.
Cool, thanks for the discussion all!
So re: the polars-support roadmap, I'll plan on working on the ibis backend integration as a n=2 sample for how well the pandera core/backend abstractions fit into supporting another non-pandas-API framework.
Will definitely need some help designing/implementing the polars-native backend: will need to ramp up on the python polars API myself, but would anyone on this thread be willing to help out?
Eventually will also need help implementing:
Please give a 👍 to this comment if you'll be able to help with one or more of the above
- creating data synthesis strategies for polars: https://github.com/unionai-oss/pandera/blob/main/pandera/strategies/pandas_strategies.py
Polars themselves ship data synthesis functions for use with hypothesis
: Api reference link.
One thing that would also be cool is validating polars LazyFrames
. A LazyFrame
is a promise of a computation/query plan. But polars knows the schema for every step in the plan, so we can validate before running the queries. I think this can be very valuable in ETL. Your pipeline is validated before it runs, and not 20mins in.
Just want to mention that I really would like to help, but I am not familiar with polars (yet). So I think in this first phase I am probably not useful. I am very much willing to learn what is needed and implement following your directions :) (please use me!)
Just want to mention that I really would like to help, but I am not familiar with polars (yet). So I think in this first phase I am probably not useful. I am very much willing to learn what is needed and implement following your directions :) (please use me!)
If you are familiar with pandera. Please join our discord. We can open a pandera thread and we can help one snippet at a time.
One thing that would also be cool is validating polars
LazyFrames
. ALazyFrame
is a promise of a computation/query plan. But polars knows the schema for every step in the plan, so we can validate before running the queries. I think this can be very valuable in ETL. Your pipeline is validated before it runs, and not 20mins in.
@ritchie46 One important aspect to keep in mind is that pandera has schema models, which is much more than column names and types. For example, a pandera schema could describe and check the constraint col_a + col_b = col_c
.
So I am not sure about the LazyFrame
s, I don't think that it is possible to validate them before actually doing a computation.
One thing that would also be cool is validating polars LazyFrames. A LazyFrame is a promise of a computation/query plan. But polars knows the schema for every step in the plan, so we can validate before running the queries.
I think this will be very valuable type checking and perhaps other dataframe metadata, though a limitation would be that it wouldn't be able to apply checks on actual values (e.g. pandera.Check.ge(0)
) before running the code (unless I'm missing something conceptually). This is fine, I think, as long as the UX for applying pandera schemas to LazyFrame
s is clarified to be only on dataframe metadata (data types, column names, etc.)
Regardless, LazyFrame
validation would definitely a huge plus!
- Support a native pandera
polars
backend, independent ofibis
. This this will be useful if folks don't want to depend onibis
and want to write custom checks with the polars API.
If you look to future-proof pandera
, expand the number of users and be strategic about it, the way to go would be to go full throttle on an independent polars
support and prioritise it over ibis
. ibis
project is roughly 9 years old and has 2.5K stars. polars
is roughly 3 years old and 14.3K stars. Many people say stars are a vanity metrics, but I disagree. They are still metrics and they do show what is being used more.
To be clear, ibis
is great, polars
is great, but if it's about sequencing new feature development and effort allocation per unit of an immediate usefulness and reaching the widest possible user number, I'd suggest to go for an independent full polars
support first.
P.D. pandera
is just awesome.
@kuatroka good feedback!
My short-term priority is still to add an ibis
backend, since the motivation there is to be able to enable in-database validation for a number of supported DBs (postgres, mysql, etc). Support for this has been in demand for a while now, the nice side-effect is that it adds (experimental) support for polars.
That said, I'm all for polars
support. Would love community contributions on this, but I owe all of you a comprehensive set of docs first on how to extend pandera with your own schema specification and backends (I'm working on this now!)
Jumping in a little late here, but as a user of both pandera
and polars
(and love both libraries), I'd be willing to contribute to make this happen so I don't have to also add pandas
as a dependency in my pipelines and perform the validation portion on pandas
dataframes!
Adding a +1 for Polars schemas!
@cosmicBboy keen to help 🚀
Sounds like you are prioritising the ibis
integration which I'd be keen to look at, do you have any workings on this yet?
Or if you are wanting to focus on ibis
I could start spiking out what an independent polars
module could look like 👌
If I understand correctly, it will be a matter of filling out the yellow PolarsSchemaBackend
and IbisBacked
branches?
@kykyi correct! There's a WIP PR for adding pyspark.sql native support: https://github.com/unionai-oss/pandera/pull/1213
We'll basically need to do the same for polars and ibis.
Nice, I may wait until #1213 is merged as it looks close and can follow the conventions it introduces rather than going about implementing polars from scratch. Is that reasonable @cosmicBboy?
In order to support a pandera polars backend, we'll need to implement the following:
pandera.api.polars
modules: this contains the schema specification of polars data structures.
container.py
: DataFrameSchema
spec for polars.DataFrame
array.py
: ArraySchema
and SeriesSchema
schema spec for polars.Series
. Note that ArraySchema
is to handle the generic case of a single vector of data, whether it's a Series or a column in a DataFrame.components.py
: Column
schema spec for a column in a polars.DataFrame
model.py
: DataFrameModel
class for dataclass/pydantic-style syntax for defining a DataFrameSchema
model_components.py
: Implements Field
and related classes/functions that capture declarative metadata that's translated into the object-based API.pandera.backends.polars
modules: This module will have backend implementations of all the classes defined in the object-based API.
container.py
: DataFrameSchemaBackend
for polars.DataFrameSchema
.array.py
: ArraySchemaBackend
and SeriesSchemaBackend
for polars.components.py
: ColumnBackend
for polars.checks.py
: implement a PolarsCheckBackend
.builtin_checks.py
: Implement the builtin_checks
functions for polars.pandera.engines.polars_engine
module: implement pandera-compatible DataType
classes for polars datatypespandera.typing.polars
module: implement generic type containers for the class-based API, allowing for DataFrame[MyModel]
syntax.pandera.accessors.polars
module: this implements the pandas-style accessors, which is used by pandera under the hood to know whether a dataframe has been validated already and also may optionally load up the dataframe object with additional metadata at validation time.Each of the modules described above has analogues in the pandas implementation. This is just an initial draft, will fill in more details in the next few days.
pandera.engines.polars_engine
seems straightforward and there is a lot of convention to follow so could be a good start. I'll get cracking!
Great that this is picked up!
thanks @kykyi ! once 0.16.0
is out I'll be able to turn my attention to stubbing out some of the pieces we need on the schema api and backend side of things.
paging folks who 👍'd this comment
@StefanBRas @AndriiG13 @kuatroka @vrd83 @FilipAisot calling for help on this effort! See https://github.com/unionai-oss/pandera/issues/1064#issuecomment-1584655803 for the mini-roadmap for adding polars support.
I'll create issues for these over the next week, just wanted to ping y'all to see how many hands we'll have in this effort.
@cosmicBboy I am willing to help so count me in.
@cosmicBboy This looks great! Polars support in pandera is something we could really use at work, so there's a good chance we could take up a couple of issues as an internal project.
so there's a good chance we could take up a couple of issues as an internal project.
Amazing @BartHelder ! I'll post a link here at EOW with a proper roadmap with issues linked for folks to pick up.
I'll create issues for these over the next week, just wanted to ping y'all to see how many hands we'll have in this effort.
@cosmicBboy very happy to contribute!
@cosmicBboy Looks good! What is the plan regarding LazyFrame versus DataFrame? From the mini-roadmap it's not not immediately clear to me.
@StefanBRas I have a few questions:
LazyFrame
vs DataFrame
? In terms of polars user code, that is. It seems LazyFrame has many advantages over DataFrame, so what are the trade-offs?Depending on the answer to these two questions it may or may not matter a lot in terms of the pandera implementation.
@StefanBRas I have a few questions:
- How do people use
LazyFrame
vsDataFrame
? In terms of polars user code, that is. It seems LazyFrame has many advantages over DataFrame, so what are the trade-offs?- Is the API the same? what are the differences, if not?
Depending on the answer to these two questions it may or may not matter a lot in terms of the pandera implementation.
If I may try to answer from my perspective.
This is my take on the questions you asked purely from a users perspective @cosmicBboy. I am not really familiar enough with pandera internals to be able to give a verdict of how necessary including LazyFrames is. Maybe only to be able to enforce structure upon a schema. Validating streaming data might be a bit of a headache.
Disclaimer: I might miss some points so please point out anything that I missed.
Thanks for the detailed explanation!
I think the order of operations would be:
polars.DataFrame
polars.LazyFrame
shouldn't be that heavy of a liftFrom a pandera perspective, all it's doing is:
Check
functions to various columns of the dataframe or the entire dataframe itself
As long as the LazyFrame API supports these two things the pandera implementation.
Pandera already supports pyspark.sql, which I believe at a high-level also creates a query plan and query execution, so I'm fairly optimistic this can be done with polars.
In the end, the pandera validation routine would be:
actual_dataframe = (
pl.scan_csv(...)
... # a bunch of checks
.collect()
)
If validation fails it'll either raise a SchemaError(s)
exception or add a pandera.errors
attribute in the dataframe object as described here.
@cosmicBboy
DataFrame
into a LazyFrame
- it's just a df.lazy()
call. A lot of the methods on the DataFrame
class itself casts to lazy and then collect after.I'm not sure if Pandera already does this, but I think there needs to be a way to explicitly distinguish between checks that requires a LazyFrame to be collected into a DataFrame and checks that does not. Any check that needs to look at the actual data of a column will need to collect first.
You could consider having the interface be identical for both LazyFrame and DataFrame. Then everything that applies to both can be implemented for LazyFrame and then DataFrames can just be cast to LazyFrames.
@StefanBRas for whatever my experience is worth, when doing data analysis and early-stage development I tend to use the non-lazy version for easy debugging and building out test cases. Then I refactor to the lazy version once I know how it all fits together. So it would be nice if at least a nontrivial common subset of Pandera worked seamlessly between both lazy and non-lazy versions.
@cosmicBboy
- My impression as just a standard user is that LazyFrame are most used. Using a LazyFrame will often give you major performance improvements and It's free to make a
DataFrame
into aLazyFrame
- it's just adf.lazy()
call. A lot of the methods on theDataFrame
class itself casts to lazy and then collect after.- Api is largely similar, some methods does not exist on LazyFrame and requires to collect the LazyFrame into a DataFrame.
I'm not sure if Pandera already does this, but I think there needs to be a way to explicitly distinguish between checks that requires a LazyFrame to be collected into a DataFrame and checks that does not. Any check that needs to look at the actual data of a column will need to collect first.
You could consider having the interface be identical for both LazyFrame and DataFrame. Then everything that applies to both can be implemented for LazyFrame and then DataFrames can just be cast to LazyFrames.
I agree, checks that require data need to explicitly state that the data will be collected. This way we avoid issues with larger than memory datasets.
So in terms of order of implementation, would it make sense to support LazyFrame
first, and then simply call .lazy()
on DataFrame
s under the hood?
In my opinion yes, but with the caveat that I haven't worked with Pandera myself.
Important Note:
I think the DataFrameModel definition should be as agnostic as possible to the dataframe library used.
This would allow writing a schema once, and then one can seamlessly switch between pandas, polars, pyspark, dask etc'.
Note: Of course, except "edge cases" (indicies, struct types, etc').
This would allow writing a schema once, and then one can seamlessly switch between pandas, polars, pyspark, dask etc'.
I think this is a worthy goal, barring a few technical challenges on making this all work nice with multiple dataframe generic types, see this issue.
For now, though, each library can get its own DataFrameModel
type, which would can eventually all merge together for the DataFrameModel
to rule them all.
Haven't read through this whole conversation, but I wanted to drop a link to this DataFrame API standard in case it hadn't been mentioned and, if it hadn't, so that it might help in creating "one DataFrameModel
to rule them all".
@rmorshea I've been keeping tabs on that project! How mature would you say it is i.e. is it ready for prime time?
According to the README it's not out of the draft stage. This issue from 3 weeks ago seems to suggest that things haven't quite crystalized, but it'd probably be best to ask the folks driving the project forward what the status is. If people from Pandera feel like they have a vested interest in a standard like that, I'm sure it would benefit from more contributors.
Hi thanks for the lib! I wonder it can support type checking for
polars
?