pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.32k stars 17.81k forks source link

misrepresented fractional seconds in timestamps and timedeltas #23059

Open cbertinato opened 5 years ago

cbertinato commented 5 years ago

Code Sample

>>> timestamp = 1490193630.8
>>> pd.to_timedelta(timestamp, unit='s')
Timedelta('17247 days 14:40:30.799999')

However, this works:

>>> pd.to_timedelta(timestamp*1e9, unit='ns')
Timedelta('17247 days 14:40:30.800000')

Problem description

Depending upon the value of the float to be converted to a timestamp or timedelta and on the unit, the resulting timestamp and timedelta will occasionally misrepresent the fractional part of the input float.

Expected Output

Timedelta('17247 days 14:40:30.8')

Output of pd.show_versions()

[paste the output of ``pd.show_versions()`` here below this line] INSTALLED VERSIONS ------------------ commit: a4482db46535b018053b4351689ccc3c79257d8c python: 3.6.5.final.0 python-bits: 64 OS: Darwin OS-release: 17.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 pandas: 0.24.0.dev0+716.ga4482db46 pytest: 3.5.1 pip: 9.0.3 setuptools: 39.0.1 Cython: 0.28.2 numpy: 1.14.3 scipy: 1.1.0 pyarrow: 0.9.0.post1 xarray: 0.10.3 IPython: 6.4.0 sphinx: 1.7.4 patsy: None dateutil: 2.7.3 pytz: 2018.4 blosc: 1.5.1 bottleneck: 1.2.1 tables: 3.4.3 numexpr: 2.6.5 feather: 0.4.0 matplotlib: 2.2.2 openpyxl: 2.5.3 xlrd: 1.1.0 xlwt: 1.3.0 xlsxwriter: 1.0.4 lxml: 4.2.1 bs4: 4.6.0 html5lib: 1.0.1 sqlalchemy: 1.2.7 pymysql: 0.8.1 psycopg2: None jinja2: 2.10 s3fs: 0.1.5 fastparquet: 0.1.5 pandas_gbq: None pandas_datareader: None gcsfs: None
cbertinato commented 5 years ago

The root of the issue goes back to cast_from_unit in https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/tslibs/timedeltas.pyx#L275, so it could very well affect other types. This is a classical case of representation error, but I think that we should represent the fractional second that was intended.

One thought is to use the decimal module here:

timestamp = Decimal(str(ts))
base = Decimal(<int64_t>ts)
frac = timestamp - base

It does the job, but there is a performance penalty. Currently:

%timeit pd.to_timedelta(pd.Series([ts]*10000), unit='s')
28.1 ms ± 873 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

versus using Decimal:

%timeit pd.to_timedelta(pd.Series([ts]*10000), unit='s')
44.9 ms ± 1.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
mroeschke commented 5 years ago

Maybe an artifact of https://github.com/pandas-dev/pandas/pull/19732 that was chalked up to floating point precision errors. We have a related warning here: https://pandas.pydata.org/pandas-docs/stable/timeseries.html#epoch-timestamps

It would be great to have a more precise calculation here, but the performance impact of using Decimal is not too attractive.