Default int type should be mapped to Int32 in Windows

probberechts commented 2 years ago

Describe the bug

Pandas handles the default int type differently on Windows and Linux. On Linuxint is interpreted as int64 but on Windows as int32. Since Pandera always maps int to int64, you get unexpected SchemaErrors on Windows. You can read more about it in these issues:

Code Sample

import pandas as pd
import pandera as pa

from pandera.typing import DataFrame, Series

class Schema(pa.SchemaModel):
    price: Series[int]

df = pd.DataFrame({ 'price': [8, 12, 10, 16] }).astype(int)

InputSchema.validate(df)

This is fine on Linux, but gives a SchemaError: expected series 'price' to have type int64, got int32 on Windows.

cosmicBboy commented 2 years ago

Thanks for pointing this out @probberechts!

The relevant part of the code is here: https://github.com/pandera-dev/pandera/blob/master/pandera/engines/pandas_engine.py#L211-L215

I think something like this will defer to the pandas default dtype for a particular python built-in type:

default_pd_dtype = pd.Series([1], dtype=builtin_name).dtype

assert np.dtype(int) == default_pd_dtype  # True
# Windows
assert np.dtype("int32") == default_pd_dtype
# non-Windows
assert np.dtype("int64") == default_pd_dtype

@jeffzi FYI, I forget why we decided to map int -> int64 regardless of OS, but I think we should defer to the behavior of the underlying platform for dtype defaults.

@probberechts would you be open to making a PR for this? Basically need to change that line of code and add an OS-specific unit test here: https://github.com/pandera-dev/pandera/blob/master/tests/core/test_dtypes.py

jeffzi commented 2 years ago

I forget why we decided to map int -> int64 regardless of OS

I can't remember either. It could be a workaround from when we were fixing the windows CI.

I agree the default int should follow pandas and pandas_engine should not hardcode the size.

That said, we do test that default pandera int matches pandas: https://github.com/pandera-dev/pandera/blob/9448d0a80b8dd02910f9cc553ce00349584b107f/tests/core/test_dtypes.py#L406-L411

The problem is that the "implicit" default for a pandas Series is int64 but astype(int) will apply numpy's default, i.e. int32 on windows.

import platform
import sys
import numpy as np
import pandas as pd

print(platform.system())
#> Windows
print(sys.version)
#> 3.8.12 (default, Oct 12 2021, 03:01:40) [MSC v.1916 64 bit (AMD64)]
print(pd.__version__)
#> 1.3.4
print(np.__version__)
#> 1.21.2

print(np.dtype(int))
#> int32
print(pd.Series([1]).dtype) # implicit dtype
#> int64
print(pd.Series([1], dtype=int).dtype) # explicit dtype
#> int32

If we have int32 as default on windows, then the validation will fail when the user does not explicitly cast the series to int (e.g.: pd.Dataframe(..., dtype=int) or astype(int)). I think that's fine but we should explain this inconsistency in the documentation.

Midnighter commented 1 year ago

This is quite annoying. I tried to fix it by explicitly defining

class Schema(pa.SchemaModel):
    price: Series[numpy.int64]

    class Config:
       coerce = True

but that didn't help either 🙁

joaoe commented 1 year ago

Howdy. In my project we have people working on Linux and a few on Windows. This causes problems for us. So far I worked around by monkeypatching Int

from pandera.dtypes import Int
Int.check = lambda self, pandera_dtype, data_container=None: isinstance(pandera_dtype, Int)

Having such work around in the project is ugly, but the fix itself is not too strange.

The default generic Int type shold allow any of int8, int16, int32 or int64. It is just Int. There are other IntN classes for the specific bid widths. But if I have a Schema with just column: int I'm not too interested in pandera checking for bid width. Perhaps, the rules in the Int class should be relaxed ?

EDIT: for reference, this is the numpy bug https://github.com/numpy/numpy/issues/9464

DrShushen commented 1 year ago

@joaoe I agree, I've had to write custom workarounds in some of my projects to essentially accomplish this on Windows: https://github.com/vanderschaarlab/temporai/blob/a6bb11d63341c4947f940e3d075e8a77be06b898/src/tempor/data/pandera_utils.py#L134-L140

cosmicBboy commented 1 year ago

So the problem here is stated by @jeffzi https://github.com/unionai-oss/pandera/issues/726#issuecomment-1008951497:

On Windows:

print(np.dtype(int))
#> int32
print(pd.Series([1]).dtype) # implicit dtype
#> int64
print(pd.Series([1], dtype=int).dtype) # explicit dtype
#> int32

It seems reasonable that pd.Series([1], dtype=int) should match pd.SeriesSchema(dtype=int). But it's not ideal that the implicit integer type of pd.Series([1]) is int64, so pd.SeriesSchema(dtype=int).validate(pd.Series([1])) will fail, which I'm sure will still be surprise many users. I think this will just have to be documented clearly somewhere.

There are 130 failing tests as a result of this PR fix: https://github.com/unionai-oss/pandera/actions/runs/4909424667/jobs/8765820508?pr=1179

If anyone on this thread so far has the bandwidth to fix all of these breaking tests on windows, that would be much appreciated! The code changes to fix this issue are already on #1179, just need to update the tests to explicitly use dtype=int in tests cases that rely on the assumption that pd.Series([1]) does the right thing.

unionai-oss / pandera

Default int type should be mapped to Int32 in Windows #726

Code Sample