Closed antonl closed 3 years ago
mathanks @antonl!
It looks like this behavior is also true for columnar/index checks:
@pandera.check("a"):
def is_positive(cls, series):
return series > 0 # or something
In this method is_positive
won't be serialized.
I think there are a couple of ideas to come out of this problem:
to_yaml
or to_script
.pa.SchemaModel
subclass (I'm not sure if this is something that you'd want?). This sort of implies that custom check methods defined in pa.SchemaModel
should also be translated into plain pandera.Check
s when calling the to_schema
method.(1) would be fairly straightforward to implement, since we just need to allow for built-in/registered checks to be specified as Config
attributes, and the existing io
module should handle serializing any checks with a defined check.statistics
attribute (see here).
(2) would require a little bit more thought, since these are currently meant to be unique to a particular SchemaModel
(i.e. only that class or its subclasses have access to that check). Conceptually, this maps onto the object-based API (DataFrameSchema
) in terms of in-line checks pandera.Check(lambda series: series > 0)
or a function that isn't registered via the extensions API:
def is_positive(series):
return series > 0
pandera.Check(is_positive)
(3) add a to_yaml
and to_script
method to the SchemaModel
class? This would be orthogonal to (1) and (2). This also reminds the conversation in #393, which proposes a DataFrameSchema.to_model
method. I think it makes sense to tackle these as separate issues.
I'm not yet sure if (2) is a good idea, but what do you think about supporting (1) first, and also perhaps throwing a UserWarning
when converting a SchemaModel
into a DataFrameSchema
when there are custom checks that can't be serialized?
Config
and make sure it can be serialized in yaml formatto_yaml
convenience method to SchemaModel
and raise a UserWarning
to let the user know custom check methods can't be serialized and to register checks via extensions APIDataFrameSchema.to_model()
to dynamically create a schema model.SchemaModel.to_script()
method to write out an inferred schema model into a python script, since SchemaModel
s are 🔥 right now.I agree that (2) is not a good idea, so raising a UserWarning
is a good approach. For the case that you'd want "reusable validators", the syntax along the lines of that discussed in #383 is great, where you'd register new validators using the extension API.
I should have perhaps split out the SchemaModel
case. I use them purely as a convenience, improving the DataFrameSchema
is definitely a higher priority. I only really care that registered extension checks get serialized in the DataFrameSchema.to_yaml()
call.
Cool! I just added this issue to the 0.8.0 release milestone. Things are getting busy in the pandera
project, and I won't have time personally to work on #383 and #419 for a few months. Let me know if you'd be down to contribute to these issues, and I can help guide you through env setup and parts of the codebase that need changing!
@cosmicBboy sounds good. I can help out next week.
fixed by #428
Describe the bug When attaching global dataframe checks to either
SchemaModel
orDataFrameSchema
, theto_yaml
call silently drops those checks.[ ] I have confirmed this bug exists on the latest version of pandas.[ ] (optional) I have confirmed this bug exists on the master branch of pandas.Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Code Sample, a copy-pastable example
Expected behavior
Ideally, I expected the registered check to be serialized. If that's not convenient, I expect a
NotImplementedError
to be raised.Additional context
Reported in response to #383.