unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.05k stars 281 forks source link

fix: add List, Dict, Tuple and NamedTuple to the GenericDType bound #1556

Open sam-goodwin opened 1 month ago

sam-goodwin commented 1 month ago

Closes https://github.com/unionai-oss/pandera/issues/1555

codecov[bot] commented 1 month ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 83.07%. Comparing base (4df61da) to head (8ab65a5). Report is 75 commits behind head on main.

:exclamation: Current head 8ab65a5 differs from pull request most recent head 9dc8ed5. Consider uploading reports for the commit 9dc8ed5 to get more accurate results

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #1556 +/- ## =========================================== - Coverage 94.29% 83.07% -11.22% =========================================== Files 91 111 +20 Lines 7024 8191 +1167 =========================================== + Hits 6623 6805 +182 - Misses 401 1386 +985 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

cosmicBboy commented 1 month ago

Thanks @sam-goodwin, see https://pandera.readthedocs.io/en/latest/CONTRIBUTING.html#set-up-pre-commit for steps to make sure linters and unit tests are passing. You'll also need to sign your commits: https://pandera.readthedocs.io/en/latest/CONTRIBUTING.html#dco-signing-commits

cosmicBboy commented 1 month ago

Mypy errors:

tests/core/test_typing.py:498: error: "list" is not subscriptable, use "typing.List" instead  [misc]
tests/core/test_typing.py:499: error: "dict" is not subscriptable, use "typing.Dict" instead  [misc]
tests/core/test_typing.py:500: error: "tuple" is not subscriptable, use "typing.Tuple" instead  [misc]

Note that pandera needs to support python 3.8 as well, so we need to use the generic types in the typing module.

Failing unit test:

FAILED tests/core/test_typing.py::test_complex_python_collection_types - pandera.errors.SchemaError: expected series 'list' to have type list[pandera.dtypes.Int32]:
failure cases:
   index failure_case
0      0       [1, 2]
1      1    [3, 4, 5]

Looks like you need to use the built-in int type? pandera.dtypes.Int32 translates to the numpy dtype for pandas columns.

sam-goodwin commented 1 month ago

Looks like you need to use the built-in int type? pandera.dtypes.Int32 translates to the numpy dtype for pandas columns.

Do you mean we can't specify ints with specific precision in a List or Dict in pandera?

cosmicBboy commented 3 weeks ago

Do you mean we can't specify ints with specific precision in a List or Dict in pandera?

This just follows the way pandas deals with data. Columns containing list or dict objects are just python objects, meaning they're not numpy arrays. This might be different for pyarrow data representations, but that'll be something to tackle when adding pyarrow support https://github.com/unionai-oss/pandera/issues/1262.

In summary, pandera.dtypes.Int32 maps onto a numpy.int32, and a list[numpy.int32] isn't meaningful in the context of pandas. list[int] does tho, and will contain just lists of python ints.

cosmicBboy commented 1 week ago

@sam-goodwin friendly ping: one of the unit tests is still failing: https://github.com/unionai-oss/pandera/actions/runs/8861081819/job/24332580434?pr=1556