pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.28k stars 17.8k forks source link

BUG: itemsize wrong for date32[day][pyarrow] dtype? #57948

Open MarcoGorelli opened 6 months ago

MarcoGorelli commented 6 months ago

Pandas version checks

Reproducible Example

import pyarrow as pa
import pandas as pd

pd.ArrowDtype(pa.date32())  # date32[day][pyarrow]
pd.ArrowDtype(pa.date32()).itemsize  # 8

Issue Description

I think it should show 4? pa.date32() is 32 bits, so 4 bytes

Expected Behavior

pd.ArrowDtype(pa.date32()).itemsize # 4

Installed Versions

INSTALLED VERSIONS ------------------ commit : b033ca94e7ae6e1320c9d65a8163bd0a6049f40a python : 3.10.12.final.0 python-bits : 64 OS : Linux OS-release : 5.15.146.1-microsoft-standard-WSL2 Version : #1 SMP Thu Jan 11 04:09:03 UTC 2024 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : C.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.1.0.dev0+761.gb033ca94e7 numpy : 1.26.4 pytz : 2024.1 dateutil : 2.8.2 setuptools : 69.1.1 pip : 24.0 Cython : 3.0.8 pytest : 8.0.2 hypothesis : 6.98.15 sphinx : 7.2.6 blosc : None feather : None xlsxwriter : 3.2.0 lxml.etree : 5.1.0 html5lib : 1.1 pymysql : 1.4.6 psycopg2 : 2.9.9 jinja2 : 3.1.3 IPython : 8.22.1 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 bottleneck : 1.3.8 fastparquet : 2024.2.0 fsspec : 2024.2.0 gcsfs : 2024.2.0 matplotlib : 3.8.3 numba : 0.59.0 numexpr : 2.9.0 odfpy : None openpyxl : 3.1.2 pyarrow : 15.0.2 pyreadstat : 1.2.6 python-calamine : None pyxlsb : 1.0.10 s3fs : 2024.2.0 scipy : 1.12.0 sqlalchemy : 2.0.27 tables : 3.9.2 tabulate : 0.9.0 xarray : 2024.2.0 xlrd : 2.0.1 zstandard : 0.22.0 tzdata : 2024.1 qtpy : 2.4.1 pyqt5 : None
MarcoGorelli commented 6 months ago

Not sure what to do here, as there isn't a numpy dtype corresponding to date32

https://github.com/pandas-dev/pandas/blob/77f9d7abee14888447a1f9942f7f6f4cdbcd927b/pandas/core/dtypes/dtypes.py#L2215-L2218

mroeschke commented 6 months ago

I think it's OK to override ArrowDtype.itemsize to handle this case separately

jorisvandenbossche commented 5 months ago

PyArrow data types of fixed width have a bit_width attribute that could be used here. That does raise for nested types, though at the moment we just return 8 from the numpy object dtype, which also doesn't necessarily makes sense.

In [13]: pd.ArrowDtype(pa.list_(pa.int32())).itemsize
Out[13]: 8
longovin commented 5 months ago

take

echerrin commented 5 months ago

take

Charlie-H7 commented 4 months ago

I'm having a hard time trying to find what and where to find the relationship between numpy_dtype and the itemsize method. Going to the definition of itemsize method does not show an implementation of the method, so I'm not sure what it is doing.

pandas/pandas/core/dtypes/dtypes.py#L2213-L2216

Any help is appreciated