unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.27k stars 305 forks source link

`to_yaml` silently drops global checks #419

Closed antonl closed 3 years ago

antonl commented 3 years ago

Describe the bug When attaching global dataframe checks to either SchemaModel or DataFrameSchema, the to_yaml call silently drops those checks.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import pandera
import pandera.extensions as extensions
import pandera.typing as pat

class GlobalSerdes(pandera.SchemaModel):
    """initial schema that includes a global check"""
    a: pat.Series[pat.Int64]
    b: pat.Series[pat.Int64]

    @pandera.dataframe_check()
    def _a_gt_b(self, df):
        """ensure that every `a` is greater than every `b`."""
        return (df["a"] > df["b"]).all()

class PostSerdes(pandera.SchemaModel):
    """after serdes, it looks like this"""
    a: pat.Series[pat.Int64]
    b: pat.Series[pat.Int64]

def test_schema_model_serdes():
    """demonstrate that to_yaml strips custom global checks"""
    initial = GlobalSerdes.to_schema().to_yaml()
    final = PostSerdes.to_schema().to_yaml()

    assert initial != final, "global check was stripped"

def test_data_frame_schema():
    """demonstrate that to_yaml fails to serialize registered global checks"""

    @extensions.register_check_method(statistics=["a_gt_b"])
    def is_a_gt_b(pandas_obj, *, a_gt_b: bool):
        return (pandas_obj["a"] > pandas_obj["b"]).all() == a_gt_b

    initial = PostSerdes.to_schema().to_yaml()

    _ = PostSerdes.to_schema()
    _.checks.append(pandera.Check.is_a_gt_b(a_gt_b=True))

    final = _.to_yaml()

    assert initial != final, "registered global custom check failed"

if __name__ == "__main__":
    test_schema_model_serdes()
    test_data_frame_schema()

Expected behavior

Ideally, I expected the registered check to be serialized. If that's not convenient, I expect a NotImplementedError to be raised.

Additional context

Reported in response to #383.

cosmicBboy commented 3 years ago

mathanks @antonl!

It looks like this behavior is also true for columnar/index checks:

@pandera.check("a"):
def is_positive(cls, series):
    return series > 0  # or something

In this method is_positive won't be serialized.

I think there are a couple of ideas to come out of this problem:

  1. support built-in dataframe-level checks as specified in https://github.com/pandera-dev/pandera/issues/383#issuecomment-759103344, which should also include support for serializing built-in or registered custom checks (via extension API) in to_yaml or to_script.
  2. support serialization of custom check methods (field-level or dataframe-level hecks) defined in a pa.SchemaModel subclass (I'm not sure if this is something that you'd want?). This sort of implies that custom check methods defined in pa.SchemaModel should also be translated into plain pandera.Checks when calling the to_schema method.

(1) would be fairly straightforward to implement, since we just need to allow for built-in/registered checks to be specified as Config attributes, and the existing io module should handle serializing any checks with a defined check.statistics attribute (see here).

(2) would require a little bit more thought, since these are currently meant to be unique to a particular SchemaModel (i.e. only that class or its subclasses have access to that check). Conceptually, this maps onto the object-based API (DataFrameSchema) in terms of in-line checks pandera.Check(lambda series: series > 0) or a function that isn't registered via the extensions API:

def is_positive(series):
    return series > 0

pandera.Check(is_positive)

(3) add a to_yaml and to_script method to the SchemaModel class? This would be orthogonal to (1) and (2). This also reminds the conversation in #393, which proposes a DataFrameSchema.to_model method. I think it makes sense to tackle these as separate issues.

I'm not yet sure if (2) is a good idea, but what do you think about supporting (1) first, and also perhaps throwing a UserWarning when converting a SchemaModel into a DataFrameSchema when there are custom checks that can't be serialized?

In Summary

antonl commented 3 years ago

I agree that (2) is not a good idea, so raising a UserWarning is a good approach. For the case that you'd want "reusable validators", the syntax along the lines of that discussed in #383 is great, where you'd register new validators using the extension API.

I should have perhaps split out the SchemaModel case. I use them purely as a convenience, improving the DataFrameSchema is definitely a higher priority. I only really care that registered extension checks get serialized in the DataFrameSchema.to_yaml() call.

cosmicBboy commented 3 years ago

Cool! I just added this issue to the 0.8.0 release milestone. Things are getting busy in the pandera project, and I won't have time personally to work on #383 and #419 for a few months. Let me know if you'd be down to contribute to these issues, and I can help guide you through env setup and parts of the codebase that need changing!

antonl commented 3 years ago

@cosmicBboy sounds good. I can help out next week.

cosmicBboy commented 3 years ago

fixed by #428