Open wmayner opened 4 years ago
@jbrockmendel
@wmayner pandas' datetime support is built on top of numpy's datetime64, which does support picosecond and attosecond units (and would be much more performant than a decimal-based implementation)
Would being limited to 64 bits be a problem for your use case? i.e. for attoseconds np.timedelta64 only supports timedeltas of about \pm 9.2 seconds (or timestamps within 9.2 seconds of 1970-01-01 00:00:00)
non-nanosecond support is a pretty common request, and I think we'd be open to it if someone stepped up to implement it.
an extension array that implemented unit support for date times / time delta would actually be somewhat straightforward in the current framework
there is another issue about this (arbitrary u it support in date times ) if some can link
@jbrockmendel Thanks. 64 bits is enough, but I'm confused by the following behavior, which seems to show a loss of precision when casting to pd.Timestamp
and then back to np.datetime64
:
import numpy as np
import pandas as pd
# Sampling frequency (Hz)
fs = 256.9901428222656
# Sampling period
t = 0.0038911998297600074
# Convert to int of attoseconds
t = int(t * 1e18)
t = np.datetime64(t, 'as')
print('np.datetime64:'.rjust(22), t)
t = pd.Timestamp(t)
print('pd.Timestamp:'.rjust(22), t)
t = np.datetime64(t)
print('back to np.datetime64:'.rjust(22), t)
Output:
np.datetime64: 1970-01-01T00:00:00.003891199829760007
pd.Timestamp: 1970-01-01 00:00:00.003891199
back to np.datetime64: 1970-01-01T00:00:00.003891
Edit: Rereading your comment, I realized that you're not saying that pandas currently supports sub-nanosecond units, just that the underlying np.datetime64
does 😄 I'm still confused that casting back to np.datetime64
appears to lose precision as well.
Output of pd.show_versions()
I'm still confused that casting back to np.datetime64 appears to lose precision as well.
When pd.Timestamp gets a non-nano np.datetime64 object, it casts it to nanoseconds, which is lossy in this use case. We could probably issue a warning about loss of precision. A PR to do so would be welcome.
Yes, I think a warning would be helpful. I can try to put that together if I have the time!
But my confusion is about the cast from pd.Timestamp
to np.datetime64
: that seems to lose precision too. In the example above the pd.Timestamp
object has nanosecond precision, so why does the np.datetime64
seem to have only microsecond precision?
so why does the np.datetime64 seem to have only microsecond precision?
That is going on inside the np.datetime64 constructor, and i can only speculate that it is treating the Timestamp as a datetime, which we would expect to have microsecond precision. To retain precision when converting from Timestamp, try ts.to_datetime64()
In 2.0 we support non-nano resolutions "s", "ms", "us", but not sub-nano resolutions. If someone wants to implement sub-nano support, it wouldn't be that difficult. I do think warning of precision loss when when constructing a Timestamp from a sub-nano datetime64 might be worthwile.
This is a place to discuss whether supporting datetime logic with sub-nanosecond precision is desirable and feasible.
Motivation
I'm using pandas to interact with neuroscientific data—specifically, electrophysiology data. These data are sampled at very precise frequencies that cannot be expressed as an integer number of nanoseconds. Often the datasets span a large enough duration that rounding the sampling frequency to the nearest nanosecond would result an unacceptable accumulation of rounding errors over the duration, so it's important to represent the sampling frequency as accurately as possible. This precludes using pandas' datetime logic to represent timestamps. It would be useful to be able to leverage that logic, though, because it makes things like down/up-sampling very easy.
The
attotime
library appears to offer support for arbitrary-precision timestamps with adatetime
interface, so it might be a good starting point for this. I just learned of it now though, so I'm not sure whether it's mature enough, etc.If this is out of scope for the project, or if there's an obvious workaround or best practice that I'm missing, sorry for the noise and please let me know!