pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.42k stars 17.84k forks source link

Sub-nanosecond datetime support #30823

Open wmayner opened 4 years ago

wmayner commented 4 years ago

This is a place to discuss whether supporting datetime logic with sub-nanosecond precision is desirable and feasible.

Motivation

I'm using pandas to interact with neuroscientific data—specifically, electrophysiology data. These data are sampled at very precise frequencies that cannot be expressed as an integer number of nanoseconds. Often the datasets span a large enough duration that rounding the sampling frequency to the nearest nanosecond would result an unacceptable accumulation of rounding errors over the duration, so it's important to represent the sampling frequency as accurately as possible. This precludes using pandas' datetime logic to represent timestamps. It would be useful to be able to leverage that logic, though, because it makes things like down/up-sampling very easy.

The attotime library appears to offer support for arbitrary-precision timestamps with a datetime interface, so it might be a good starting point for this. I just learned of it now though, so I'm not sure whether it's mature enough, etc.

If this is out of scope for the project, or if there's an obvious workaround or best practice that I'm missing, sorry for the noise and please let me know!

WillAyd commented 4 years ago

@jbrockmendel

jbrockmendel commented 4 years ago

@wmayner pandas' datetime support is built on top of numpy's datetime64, which does support picosecond and attosecond units (and would be much more performant than a decimal-based implementation)

Would being limited to 64 bits be a problem for your use case? i.e. for attoseconds np.timedelta64 only supports timedeltas of about \pm 9.2 seconds (or timestamps within 9.2 seconds of 1970-01-01 00:00:00)

non-nanosecond support is a pretty common request, and I think we'd be open to it if someone stepped up to implement it.

jreback commented 4 years ago

an extension array that implemented unit support for date times / time delta would actually be somewhat straightforward in the current framework

jreback commented 4 years ago

there is another issue about this (arbitrary u it support in date times ) if some can link

wmayner commented 4 years ago

@jbrockmendel Thanks. 64 bits is enough, but I'm confused by the following behavior, which seems to show a loss of precision when casting to pd.Timestamp and then back to np.datetime64:

import numpy as np
import pandas as pd

# Sampling frequency (Hz)
fs = 256.9901428222656
# Sampling period
t = 0.0038911998297600074
# Convert to int of attoseconds
t = int(t * 1e18)

t = np.datetime64(t, 'as')
print('np.datetime64:'.rjust(22), t)
t = pd.Timestamp(t)
print('pd.Timestamp:'.rjust(22), t)
t = np.datetime64(t)
print('back to np.datetime64:'.rjust(22), t)

Output:

        np.datetime64: 1970-01-01T00:00:00.003891199829760007
         pd.Timestamp: 1970-01-01 00:00:00.003891199
back to np.datetime64: 1970-01-01T00:00:00.003891

Edit: Rereading your comment, I realized that you're not saying that pandas currently supports sub-nanosecond units, just that the underlying np.datetime64 does 😄 I'm still confused that casting back to np.datetime64 appears to lose precision as well.


Output of pd.show_versions()

``` INSTALLED VERSIONS ------------------ commit : None python : 3.7.4.final.0 python-bits : 64 OS : Linux OS-release : 4.15.0-72-generic machine : x86_64 processor : x86_64 byteorder : little LC_ALL : en_US.UTF-8 LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 0.25.3 numpy : 1.18.1 pytz : 2019.3 dateutil : 2.8.0 pip : 19.3.1 setuptools : 42.0.2 Cython : 0.29.14 pytest : 5.0.1 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : 2.8.4 (dt dec pq3 ext lo64) jinja2 : 2.10.3 IPython : 7.10.2 pandas_datareader: None bs4 : None bottleneck : None fastparquet : None gcsfs : None lxml.etree : None matplotlib : 3.1.1 numexpr : 2.7.1 odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pytables : None s3fs : None scipy : 1.3.2 sqlalchemy : None tables : 3.5.1 xarray : 0.13.0 xlrd : 1.2.0 xlwt : None xlsxwriter : None ```
jbrockmendel commented 4 years ago

I'm still confused that casting back to np.datetime64 appears to lose precision as well.

When pd.Timestamp gets a non-nano np.datetime64 object, it casts it to nanoseconds, which is lossy in this use case. We could probably issue a warning about loss of precision. A PR to do so would be welcome.

wmayner commented 4 years ago

Yes, I think a warning would be helpful. I can try to put that together if I have the time!

But my confusion is about the cast from pd.Timestamp to np.datetime64: that seems to lose precision too. In the example above the pd.Timestamp object has nanosecond precision, so why does the np.datetime64 seem to have only microsecond precision?

jbrockmendel commented 4 years ago

so why does the np.datetime64 seem to have only microsecond precision?

That is going on inside the np.datetime64 constructor, and i can only speculate that it is treating the Timestamp as a datetime, which we would expect to have microsecond precision. To retain precision when converting from Timestamp, try ts.to_datetime64()

jbrockmendel commented 1 year ago

In 2.0 we support non-nano resolutions "s", "ms", "us", but not sub-nano resolutions. If someone wants to implement sub-nano support, it wouldn't be that difficult. I do think warning of precision loss when when constructing a Timestamp from a sub-nano datetime64 might be worthwile.