pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.87k stars 18.02k forks source link

BUG: Creation of UInt64 column with `18446744073709551615` throws RuntimeWarning #60050

Open Tishj opened 1 month ago

Tishj commented 1 month ago

Pandas version checks

Reproducible Example

import numpy.ma
import pandas

print(pandas.__version__)

data = {
    'a': numpy.ma.masked_array(
        mask=[
            True,
            True,
            False
        ],
        data=[
            0, 0, 18446744073709551615
        ],
        dtype='uint64'
    )
}

data['a'] = pandas.Series(data['a'], dtype='UInt64')

df = pandas.DataFrame.from_dict(data)
print(df['a'].dtype)
print(df)
2.2.3
/Users/thijs/.pyenv/versions/3.11.0/lib/python3.11/site-packages/pandas/core/arrays/integer.py:55: RuntimeWarning: invalid value encountered in cast
  casted = values.astype(dtype, copy=copy)
UInt64
                      a
0                  <NA>
1                  <NA>
2  18446744073709551615

Issue Description

Creating a nullable uint64 column throws errors

Expected Behavior

I would expect this to work without throwing an error, I don't get why the data is interpreted as a float64 when uint64 is explicitly set as the dtype on the masked array

Installed Versions

INSTALLED VERSIONS
------------------
commit                : 0691c5cf90477d3503834d983f69350f250a6ff7
python                : 3.11.0
python-bits           : 64
OS                    : Darwin
OS-release            : 23.1.0
Version               : Darwin Kernel Version 23.1.0: Mon Oct  9 21:28:45 PDT 2023; root:xnu-10002.41.9~6/RELEASE_ARM64_T6020
machine               : arm64
processor             : arm
byteorder             : little
LC_ALL                : None
LANG                  : en_US.UTF-8
LOCALE                : en_US.UTF-8

pandas                : 2.2.3
numpy                 : 1.26.4
pytz                  : 2024.1
dateutil              : 2.8.2
pip                   : 24.0
Cython                : 3.0.11
sphinx                : None
IPython               : 8.22.0
adbc-driver-postgresql: None
adbc-driver-sqlite    : None
bs4                   : 4.12.3
blosc                 : None
bottleneck            : None
dataframe-api-compat  : None
fastparquet           : None
fsspec                : 2024.2.0
html5lib              : None
hypothesis            : None
gcsfs                 : None
jinja2                : 3.1.3
lxml.etree            : None
matplotlib            : None
numba                 : None
numexpr               : None
odfpy                 : None
openpyxl              : None
pandas_gbq            : None
psycopg2              : 2.9.9
pymysql               : None
pyarrow               : 17.0.0
pyreadstat            : None
pytest                : 8.0.0
python-calamine       : None
pyxlsb                : None
s3fs                  : None
scipy                 : None
sqlalchemy            : 2.0.35
tables                : None
tabulate              : None
xarray                : None
xlrd                  : None
xlsxwriter            : None
zstandard             : None
tzdata                : 2023.4
qtpy                  : None
pyqt5                 : None
rhshadrach commented 1 month ago

Thanks for the report. In the Series constructor, we call pandas.core.construction.sanitize_masked_array which coerces the data to float because there are masked values. I'm guessing if we passed the requested dtype and it's a masked EA, we could avoid this.

Further investigations and PRs are welcome!

lfffkh commented 3 weeks ago

take

lfffkh commented 3 days ago

Hello everyone, I am a bit confused about how to solve this issue now. In addition to my attempts in PR, I have tried many other methods, but the results have not been very good, so I am currently a bit confused and don't know how to solve this bug.