pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.6k stars 17.9k forks source link

BUG: timedelta64[s] series constructor isn't equal with alternative constructor using to_timedelta unit='s' #48312

Open ntachukwu opened 2 years ago

ntachukwu commented 2 years ago

Pandas version checks

Reproducible Example

import pandas as pd
from pandas import Series
import pandas._testing as tm

result = Series([1000000, 200000, 3000000], dtype="timedelta64[s]")
expected = Series(pd.to_timedelta([1000000, 200000, 3000000], unit="s"))
tm.assert_series_equal(result, expected)

Issue Description

This code passes

 result = Series([1000000, 200000, 3000000], dtype="timedelta64[ns]")
 expected = Series(pd.to_timedelta([1000000, 200000, 3000000], unit="ns"))
 tm.assert_series_equal(result, expected)

But when dtype="timedelta64[s]" and unit="s" it returns

AssertionError: numpy array are different

numpy array values are different (100.0 %)
[index]: [0, 1, 2]
[left]:  [1000000, 200000, 3000000]
[right]: [1000000000000000, 200000000000000, 3000000000000000]

Expected Behavior

Both series should be equal.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 201cbf6bc1e9f2a4cdd6ae19b2a42fa10effb0c8 python : 3.9.10.final.0 python-bits : 64 OS : Darwin OS-release : 21.3.0 Version : Darwin Kernel Version 21.3.0: Wed Jan 5 21:37:58 PST 2022; root:xnu-8019.80.24~20/RELEASE_ARM64_T8101 machine : arm64 processor : arm byteorder : little LC_ALL : None LANG : None LOCALE : None.UTF-8 pandas : 1.5.0.dev0+1364.g201cbf6bc1.dirty numpy : 1.22.3 pytz : 2022.1 dateutil : 2.8.2 setuptools : 60.9.3 pip : 22.1.1 Cython : 0.29.32 pytest : 7.1.2 hypothesis : 6.52.3 sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : 4.8.0 html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : 8.3.0 pandas_datareader: 0.10.0 bs4 : 4.11.1 bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None matplotlib : 3.5.2 numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : 1.8.1 snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None tzdata : None
phofl commented 2 years ago

Simple reproducer:

result = np.array([1000000, 200000, 3000000], dtype="timedelta64[s]")
result_pandas = pd.Series([1000000, 200000, 3000000], dtype="timedelta64[s]")
tm.assert_numpy_array_equal(result, result_pandas.values)

This should pass, but we seem to ignore the seconds and interpret it as nanoseconds

phofl commented 2 years ago

This is currently not supported and should raise imo rather than returning buggy conversions

cc @jbrockmendel

jbrockmendel commented 2 years ago

This is currently not supported and should raise imo rather than returning buggy conversions

Agreed.

Also will be supported in 2.0, so just need a temporary patch for 1.4.x/1.5.x

jbrockmendel commented 2 years ago

cc @mroeschke @jreback this becomes more salient with non-nano support. pd.Series([1, 2, 3], dtype="m8[s]") i think ideally should interpret those integers as seconds, but without an API change it will interpret them as nanoseconds, then cast the result to m8[s]. Interpreting them as seconds would also be consistent with pd.Series([1, 2 , 3]).astype("m8[s]")

mroeschke commented 2 years ago

That sounds reasonable; it also make it effectively similar to to_timedelta([ints], unit="s") which in spirit mangles "unit" and "reso" but may not matter.

jbrockmendel commented 2 years ago

cc @jreback

jbrockmendel commented 2 years ago

possible deprecation cycles notwithstanding, my preferred behavior would be for pd.Series(some_ints, dtype="m8[unit]").to_numpy() to match np.array(some_ints, dtype="m8[unit]"). i'd do the same for dt64 dtypes.

jreback commented 2 years ago

proposal sounds good