pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.41k stars 17.83k forks source link

BUG: `.dt.microsecond` for pyarrow-backed Series returns 0 #59154

Closed MarcoGorelli closed 2 months ago

MarcoGorelli commented 3 months ago

Pandas version checks

Reproducible Example

In [22]: s = pd.Series(pd.date_range('2000', periods=3, freq='15ms'))

In [23]: s.dt.microsecond
Out[23]:
0        0
1    15000
2    30000
dtype: int32

In [24]: s.convert_dtypes(dtype_backend='pyarrow')
Out[24]:
0           2000-01-01 00:00:00
1    2000-01-01 00:00:00.015000
2    2000-01-01 00:00:00.030000
dtype: timestamp[ns][pyarrow]

In [25]: s.convert_dtypes(dtype_backend='pyarrow').dt.microsecond
Out[25]:
0    0
1    0
2    0
dtype: int64[pyarrow]

Issue Description

For non-pyarrow backed dtype, it returns the total number of microseconds since the last second

For pyarrow-backed, it just returns 0

Expected Behavior

In [25]: s.convert_dtypes(dtype_backend='pyarrow').dt.microsecond
Out[25]:
0        0
1    15000
2    30000
dtype: int64[pyarrow]

Installed Versions

INSTALLED VERSIONS ------------------ commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.11.9.final.0 python-bits : 64 OS : Linux OS-release : 5.15.153.1-microsoft-standard-WSL2 Version : #1 SMP Fri Mar 29 23:14:13 UTC 2024 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : C.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.2.2 numpy : 2.0.0 pytz : 2024.1 dateutil : 2.9.0.post0 setuptools : 65.5.0 pip : 24.0 Cython : None pytest : 8.1.1 hypothesis : 6.100.1 sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.3 IPython : 8.23.0 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : 2024.3.1 gcsfs : None matplotlib : 3.8.4 numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 16.1.0 pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : 1.13.0 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None
Aloqeely commented 3 months ago

This looks like an issue with pyarrow.

In [2]: import pyarrow as pa

In [3]: s = pd.Series(pd.date_range('2000', periods=3, freq='15ms'))

In [4]: pa.Array.from_pandas(s)
Out[4]:
<pyarrow.lib.TimestampArray object at 0x00000183A8E793C0>
[
  2000-01-01 00:00:00.000000000,
  2000-01-01 00:00:00.015000000,
  2000-01-01 00:00:00.030000000
]

In [5]: pa.compute.microsecond(pa.Array.from_pandas(s))
Out[5]:
<pyarrow.lib.Int64Array object at 0x00000183A8E7BDC0>
[
  0,
  0,
  0
]
MarcoGorelli commented 3 months ago

thanks for checking - looks like it's not a bug in pyarrow, but just that pyarrow's method does something different

https://arrow.apache.org/docs/python/generated/pyarrow.compute.microsecond.html

Millisecond returns number of microseconds since the last full millisecond

So, I think this needs working around in pandas