pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.14k stars 17.75k forks source link

BUG: Inconsistent behavior while constructing a Series with large integers in a int64 masked array #56566

Open MridulS opened 8 months ago

MridulS commented 8 months ago

Pandas version checks

Reproducible Example

In [1]: import pandas as pd

In [2]: import numpy.ma as ma

In [3]: mx = ma.masked_array([4873214862074861312, 4875446630161458944, 4824652147895424384, 0, 3526420114272476800], mask=[0, 0, 0, 1, 0])

In [4]: pd.Series(mx, dtype='Int64')
Out[4]: 
0    4873214862074861568
1    4875446630161459200
2    4824652147895424000
3                   <NA>
4    3526420114272476672
dtype: Int64

In [5]: mx.data - pd.Series(mx, dtype='Int64')
Out[5]: 
0    -256
1    -256
2     384
3    <NA>
4     128
dtype: Int64

Issue Description

While creating a series object from a masked array of large-ish integers (less than max of Int64), the output doesn't match the input. This has probably something to do with float downcast/upcast somewhere.

Probably related (?) https://github.com/pandas-dev/pandas/issues/30268, https://github.com/pandas-dev/pandas/pull/50757

Expected Behavior

mx.data - pd.Series(mx, dtype='Int64') should be all zero with <NA> for the mask.

Installed Versions

INSTALLED VERSIONS ------------------ commit : a671b5a8bf5dd13fb19f0e88edc679bc9e15c673 python : 3.11.6.final.0 python-bits : 64 OS : Darwin OS-release : 23.2.0 Version : Darwin Kernel Version 23.2.0: Wed Nov 15 21:53:18 PST 2023; root:xnu-10002.61.3~2/RELEASE_ARM64_T6000 machine : arm64 processor : arm byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.1.4 numpy : 1.26.1 pytz : 2023.3.post1 dateutil : 2.8.2 setuptools : 68.2.2 pip : 23.3.1 Cython : 3.0.5 pytest : 7.4.3 hypothesis : 6.88.3 sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : 1.1 pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : 8.17.2 pandas_datareader : None bs4 : 4.12.2 bottleneck : None dataframe-api-compat: None fastparquet : None fsspec : 2023.12.2 gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 14.0.1 pyreadstat : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2023.3 qtpy : None pyqt5 : None
rhshadrach commented 8 months ago

Thanks for the report, confirmed on main. Further investigation and PRs to fix are welcome!

parthi-siva commented 8 months ago

take

dontgoto commented 4 months ago

As mentioned in #58173, this issue extends beyond the Series constructor. Also Series.to_numpy() has the same issue (only apparent once one fixes the error described in the issue here).