unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.37k stars 310 forks source link

SchemaModel support for columns which are not valid python variable names #324

Closed Fesche closed 3 years ago

Fesche commented 3 years ago

The schema model syntax today does not allow for column names which make Python throw a SyntaxError. Clearly there's no getting around the SyntaxError, so some workaround should be implemented.

Pydantic allows for aliasing of Fields. A similar functionality should solve the problem.

The problem at hand:

image

Solution?

image

Thanks @cosmicBboy for answering my questions! Big fan of the project, Bjørn

cosmicBboy commented 3 years ago

Thanks for this feature request @Fesche, I think it'll be very useful :)

@jeffzi I was wondering if you have any additional thoughts about this, since we did talk about it here: https://github.com/pandera-dev/pandera/pull/258#issuecomment-693692157

I had this quick thought about custom check methods, where users can specify the class attribute name or the alias in a check, although I'm not sure what extra benefit the latter offers, so maybe scratch that :)

class Schema(SchemaModel):
    col1: Series[int]
    # attribute must be a valid python variable name, col_2020 could be anything
    col_2020: Series[int] = Field(alias=2020)

    # for custom validation methods, either reference the attribute name
    @check("col_2020")
    def custom_check_attr(cls, a: Series[int]) -> Series[bool]:
        return a > 0

    # or supply `alias=True` to reference the alias
    @check(2020, alias=True)
    def custom_check_alias(cls, a: Series[int]) -> Series[bool]:
        return a < 100
jeffzi commented 3 years ago

I agree with adding alias.

we did talk about it here: #258 (comment)

At that time we overlooked the use-case highlighted here 🙃

I had this quick thought about custom check methods, where users can specify the class attribute name or the alias in a check...

I think the alias should overwrite the variable name. That's what pydantic does. It simplifies the implementation and clarifies the user code since the final schema columns will always be referenced by their alias (except in the declaration obviously).

It's straightforward to add this feature since the underlying FieldInfo already asks for a name when it converts to column or index. See https://github.com/pandera-dev/pandera/blob/31fe07b30b267527ecea7f6b1435f49b6b963abf/pandera/model_components.py#L73-L80

We'll need to first collect the optional aliases from the fields, construct the final names and modify this line: https://github.com/pandera-dev/pandera/blob/31fe07b30b267527ecea7f6b1435f49b6b963abf/pandera/model.py#L102

cosmicBboy commented 3 years ago

I think the alias should overwrite the variable name. That's what pydantic does

cool, this makes sense to me!

@jeffzi or @Fesche let me know if either of you want to work on this, I'm still working on a fairly heavy-lift-of-a-PR https://github.com/pandera-dev/pandera/pull/314 and won't be able to get to this for a few weeks.

jeffzi commented 3 years ago

I can work on it once #326 has been merged. Modifications will need to happen in one piece of code modified by that issue.

cosmicBboy commented 3 years ago

cool, thanks @jeffzi, just merged #326

cosmicBboy commented 3 years ago

@fesche #329 should address this issue, coming out in the next 0.6.0 release (or you can clone master branch and install locally if you wanna try it out early 😃)

Fesche commented 3 years ago

Awesome, that was faster than I'd hoped for! Hats off to you guys! 🥇