Open probberechts opened 2 years ago
Thanks for pointing this out @probberechts!
The relevant part of the code is here: https://github.com/pandera-dev/pandera/blob/master/pandera/engines/pandas_engine.py#L211-L215
I think something like this will defer to the pandas default dtype for a particular python built-in type:
default_pd_dtype = pd.Series([1], dtype=builtin_name).dtype
assert np.dtype(int) == default_pd_dtype # True
# Windows
assert np.dtype("int32") == default_pd_dtype
# non-Windows
assert np.dtype("int64") == default_pd_dtype
@jeffzi FYI, I forget why we decided to map int -> int64
regardless of OS, but I think we should defer to the behavior of the underlying platform for dtype defaults.
@probberechts would you be open to making a PR for this? Basically need to change that line of code and add an OS-specific unit test here: https://github.com/pandera-dev/pandera/blob/master/tests/core/test_dtypes.py
I forget why we decided to map int -> int64 regardless of OS
I can't remember either. It could be a workaround from when we were fixing the windows CI.
I agree the default int should follow pandas and pandas_engine
should not hardcode the size.
That said, we do test that default pandera int matches pandas: https://github.com/pandera-dev/pandera/blob/9448d0a80b8dd02910f9cc553ce00349584b107f/tests/core/test_dtypes.py#L406-L411
The problem is that the "implicit" default for a pandas Series is int64 but astype(int)
will apply numpy's default, i.e. int32 on windows.
import platform
import sys
import numpy as np
import pandas as pd
print(platform.system())
#> Windows
print(sys.version)
#> 3.8.12 (default, Oct 12 2021, 03:01:40) [MSC v.1916 64 bit (AMD64)]
print(pd.__version__)
#> 1.3.4
print(np.__version__)
#> 1.21.2
print(np.dtype(int))
#> int32
print(pd.Series([1]).dtype) # implicit dtype
#> int64
print(pd.Series([1], dtype=int).dtype) # explicit dtype
#> int32
If we have int32 as default on windows, then the validation will fail when the user does not explicitly cast the series to int
(e.g.: pd.Dataframe(..., dtype=int)
or astype(int)
). I think that's fine but we should explain this inconsistency in the documentation.
This is quite annoying. I tried to fix it by explicitly defining
class Schema(pa.SchemaModel):
price: Series[numpy.int64]
class Config:
coerce = True
but that didn't help either 🙁
Howdy.
In my project we have people working on Linux and a few on Windows. This causes problems for us.
So far I worked around by monkeypatching Int
from pandera.dtypes import Int
Int.check = lambda self, pandera_dtype, data_container=None: isinstance(pandera_dtype, Int)
Having such work around in the project is ugly, but the fix itself is not too strange.
The default generic Int type shold allow any of int8, int16, int32 or int64. It is just Int. There are other IntN classes for the specific bid widths. But if I have a Schema with just column: int
I'm not too interested in pandera checking for bid width.
Perhaps, the rules in the Int
class should be relaxed ?
EDIT: for reference, this is the numpy bug https://github.com/numpy/numpy/issues/9464
@joaoe I agree, I've had to write custom workarounds in some of my projects to essentially accomplish this on Windows: https://github.com/vanderschaarlab/temporai/blob/a6bb11d63341c4947f940e3d075e8a77be06b898/src/tempor/data/pandera_utils.py#L134-L140
So the problem here is stated by @jeffzi https://github.com/unionai-oss/pandera/issues/726#issuecomment-1008951497:
On Windows:
print(np.dtype(int))
#> int32
print(pd.Series([1]).dtype) # implicit dtype
#> int64
print(pd.Series([1], dtype=int).dtype) # explicit dtype
#> int32
It seems reasonable that pd.Series([1], dtype=int)
should match pd.SeriesSchema(dtype=int)
. But it's not ideal that the implicit integer type of pd.Series([1])
is int64
, so pd.SeriesSchema(dtype=int).validate(pd.Series([1]))
will fail, which I'm sure will still be surprise many users. I think this will just have to be documented clearly somewhere.
There are 130 failing tests as a result of this PR fix: https://github.com/unionai-oss/pandera/actions/runs/4909424667/jobs/8765820508?pr=1179
If anyone on this thread so far has the bandwidth to fix all of these breaking tests on windows, that would be much appreciated! The code changes to fix this issue are already on #1179, just need to update the tests to explicitly use dtype=int
in tests cases that rely on the assumption that pd.Series([1])
does the right thing.
Describe the bug
Pandas handles the default
int
type differently on Windows and Linux. On Linuxint
is interpreted asint64
but on Windows asint32
. Since Pandera always mapsint
toint64
, you get unexpectedSchemaError
s on Windows. You can read more about it in these issues:Code Sample
This is fine on Linux, but gives a
SchemaError: expected series 'price' to have type int64, got int32
on Windows.