pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
42.59k stars 17.57k forks source link

BUG: `DataFrame.to_numpy()` unnecessarily upcasts to `object` dtype. #59054

Closed randolf-scholz closed 1 week ago

randolf-scholz commented 1 week ago

Pandas version checks

Reproducible Example

import pandas as pd
df = pd.DataFrame({"a" : [1, 2], "b" : [3, 4]}, dtype="int64[pyarrow]")
assert df["a"].to_numpy().dtype == int  # ✅
assert df["b"].to_numpy().dtype == int  # ✅
assert df.to_numpy().dtype == int  # ❌ object dtype instead of int

Issue Description

All columns convert to int dtypes, but the full frame converts to object dtype.

Expected Behavior

One would expect a commutative property: the dtype of DataFrame.to_numpy() should be identical to the dtype of

np.stack([df[col].to_numpy() for col in df], axis=-1)

(In fact, this provides a naive way of implementing DataFrame.to_numpy().)

Installed Versions

INSTALLED VERSIONS ------------------ commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.11.7.final.0 python-bits : 64 OS : Linux OS-release : 6.5.0-41-generic Version : #41~22.04.2-Ubuntu SMP PREEMPT_DYNAMIC Mon Jun 3 11:32:55 UTC 2 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.2.2 numpy : 2.0.0 pytz : 2024.1 dateutil : 2.9.0.post0 setuptools : 70.1.0 pip : 24.0 Cython : None pytest : 8.2.2 hypothesis : 6.103.2 sphinx : 7.3.7 blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.4 IPython : 8.25.0 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : 2024.6.0 gcsfs : None matplotlib : 3.9.0 numba : None numexpr : None odfpy : None openpyxl : 3.1.4 pandas_gbq : None pyarrow : 16.1.0 pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : 1.13.1 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None
mroeschke commented 1 week ago

Thanks for the report. The core issue is based in https://github.com/pandas-dev/pandas/issues/22791 so will close in favor of that issue