pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.23k stars 17.78k forks source link

Support or default to less detailed datetime64 #7307

Closed CarstVaartjes closed 1 year ago

CarstVaartjes commented 10 years ago

Hi,

I regularly run into issues where I have dates that fall outside of Pandas's datetime standards. Quite a few data sources have defaults such as "9999-12-31" and stuff like that, leading to issues in pandas.

This is because Pandas defaults to nanoseconds where the time span is quite limited. See: http://docs.scipy.org/doc/numpy/reference/arrays.datetime.html Code Meaning Time span (relative) Time span (absolute) s second +/- 2.9e12 years [ 2.9e9 BC, 2.9e9 AD] ms millisecond +/- 2.9e9 years [ 2.9e6 BC, 2.9e6 AD] us microsecond +/- 2.9e6 years [290301 BC, 294241 AD] ns nanosecond +/- 292 years [ 1678 AD, 2262 AD]

I first thought it was the unit='s' parameter in to_datetime would work (see: http://pandas.pydata.org/pandas-docs/version/0.14.0/generated/pandas.tseries.tools.to_datetime.html), but this is only for translating a different datetime to nano seconds (I think) and the "ns" detail level seems to be rather hard coded.

I cannot imagine the majority of the use cases needing nano seconds; even going to micro seconds extends the date range to something that in my experience should always work. The nanosecond 2262AD is really limited.

Imho, ideally one should be able to choose the detail level. Is this just me or is this a common issue?

jreback commented 10 years ago

http://pandas-docs.github.io/pandas-docs-travis/gotchas.html#timestamp-limitations

simply use periods and use NaT as appropriate for missing values

ifmihai commented 10 years ago

it's a more common issue than believed

I bump into this issue more often than not

my guess is that at least 99% of pandas users dont need nanoseconds

if this is true, then it would be common sense to use microseconds as default and then to have something like df = P.DataFrame(index=nano_range(start,end))

nano seconds usage is a particular case, not the rule

I really don't understand the rationale behind all this nano planning

But I dont understand how hard it would be to change to microseconds, either

jreback commented 10 years ago

@ifmihai

In [84]: period_range('201-1-1 10:20','9000-1-1 12:00',freq='H')
Out[84]: 
<class 'pandas.tseries.period.PeriodIndex'>
[201-01-01 10:00, ..., 9000-01-01 12:00]
Length: 77130459, Freq: H

you can simply use a PeriodIndex to represent any time span you want. DatetimeIndex and PeriodIndex have different usecases. DatetimeIndex provides maximum resolution within the most common timespans and is thus pretty general. If you need more range but less resolution then use a PeriodIndex. You CAN have it both ways. No need to change the defaults and try to conform everyone to one representation.

ifmihai commented 10 years ago

@jreback "no need to change the defaults and try to conform everyone to one representation

what about nanoseconds? isnt it the same?

i feel forced to conform to whoever considered nanoseconds is best (which is really not practical for most user cases) I would surely what to hearthe opinion of him who decided nanosecond unit listen to what he has to say I doubt it will be convincing, except probably he was forced by numpy, but that's out of my league

anyway, nanosecond unit receives a big -1 from me (i already started to hate nanoseconds btw :D)

jreback commented 10 years ago

@ifmihai this was decided quite a long time ago, by @wesm (and I think it was numpy that drove it, not really sure). As I said before, that was one reason Periods were created. Changing the default is a non-starter.

CarstVaartjes commented 10 years ago

But how can we read a csv file for instance and convert dates to periods? Sincere question, as I'm really struggling with this and the documentation is really unclear here. Because there is

1) a "to_datetime" where I can give a date format but it will refuse to convert out of nanosecond bound datetimes, so I cannot use it (I will lose data); the units do not work (as in: "unit='s'" does not solve the out of bound issue, while in numpy it does!) 2) a "to_timestamp", but that's not really anything with a string to date conversion 3) a "to_period", also not a string to date conversion but "Convert TimeSeries from DatetimeIndex to PeriodIndex"; however I cannot create a DatetimeIndex with my values in the first place because I lose all out of bound values

See also this example:

x_df = DataFrame([[20120101, 20121231], [20130101, 20131231], [20140101, 20141231], [20150101, 99991231]])
x_df.columns = ['date_from', 'date_to']
date_def = '%Y%m%d'
x_df['date_to_2'] = [datetime.datetime.strptime(str(date_val), date_def) for date_val in x_df['date_to']]
x_df['date_to_3'] = [np.datetime64(date_val, unit='s') for date_val in x_df['date_to_2']] #
# it's all objects, even though date_to_3 is a full np.datetime64
x_df.dtypes
list(x_df['date_to_3'])

i cannot make date_to_3 into a period (as far as i know) and while being a perfectly nice np.datetime64 (with unit='s' instead of unit='ns'), Pandas refuses to see it as such. It's really a Pandas limitation afaik; this really is an issue and Periods do not solve it (in any way that I can see at the moment)

jreback commented 10 years ago

Read them in as ints (you don't really need to do anything special to do this), just don't specify 'parse_dates'

In [27]: x_df.to_csv('test.csv',mode='w')

In [28]: !cat test.csv
,date_from,date_to
0,20120101,20121231
1,20130101,20131231
2,20140101,20141231
3,20150101,99991231

In [29]: df = read_csv('test.csv',index_col=0)

In [30]: df
Out[30]: 
   date_from   date_to
0   20120101  20121231
1   20130101  20131231
2   20140101  20141231
3   20150101  99991231

In [31]: df.dtypes
Out[31]: 
date_from    int64
date_to      int64
dtype: object

Define a converter that creates a period from an int

In [32]: def conv(x):
    return Period(year=x/10000,month=x/100 % 100, day=x%100,freq='D')
   ....: 

In [33]: converted = df.applymap(conv)

In [34]: converted
Out[34]: 
    date_from     date_to
0  2012-01-01  2012-12-31
1  2013-01-01  2013-12-31
2  2014-01-01  2014-12-31
3  2015-01-01  9999-12-31

In [35]: converted.iloc[3,0]
Out[35]: Period('2015-01-01', 'D')

In [36]: converted.iloc[3,1]
Out[36]: Period('9999-12-31', 'D')

Of course this could be vectorized / more natural, e.g. a to_periods is prob a good idea.

Give it a go!

CarstVaartjes commented 10 years ago

Thanks! It works perfectly like this, I will test HDF5 saving & queries this week too.

I would really add this snippet to http://pandas.pydata.org/pandas-docs/stable/timeseries.html (and yes, to_periods would be a really great idea :)

jreback commented 10 years ago

ok, why don't you open a feature request for to_periods (with as complete doc-string as you can)

CarstVaartjes commented 10 years ago

I've put it on my to do list for later this week (and will unscrupulously copy from to_datetime with the additional Period frequency formats)!

shoyer commented 10 years ago

@jreback I know PeriodIndex is the suggested work around for dates that don't fit in ns precision, but there is a difference between periods and datetimes for meaning as well -- periods refer to a periods of time rather than a time point.

For what it's worth, I've played a bit with np.datetime64 and I don't think nanosecond decision was driven by numpy. Numpy seems to prefer us/microsecond precision (there are a few more bugs with ns precision).

I do recognize that this may be too late to change now but I do think this is something worth considering. I suspect there are a lot more users who would be happy with a fixed choice of "us" rather than "ns" precision. Of course, it would also be a lot of work to actually do the implementation (and it's not the top priority for me).

jreback commented 10 years ago

I think this would have to be a hybrid, e.g. carry around the units on the Timestamp/DatetimeIndex. Not a big deal to add it. BUT would need some validation.

I think it IS possible, but prob a bit of work.

shoyer commented 8 years ago

Reopening this -- this is still a recurrent issue, despite the work around of using PeriodIndex.

rabernat commented 8 years ago

👍 From me. The date range limitation is a HUGE issue for many scientific communities. I find it very hard to believe that there are more pandas users who want nanoseconds than who want to use dates before 1678.

jreback commented 8 years ago

As I have stated many times, this would require some major effort. It IS certainly possible, and the extension dtypes are able to support this. But would really need to be spearheaded by someone who this would be very useful.

CarstVaartjes commented 8 years ago

Hi @jreback! I definitely understand as this https://github.com/pydata/pandas/search?q=ns gives 150 matches! But the wanted solution matters a lot I think. Replacing it by microseconds instead of nanoseconds would be not that difficult (of course, thorough testing needed etc), but might break things (do we know if people really use nanosecond precision atm?) Making it dynamic would be more difficult and still have a series of questions (just of the top of my head):

Can you give some guidance as to what should be the best route for this and what kind of requirements you would have?

P.s. because of HDF5 and Bcolz we couldn't switch to time periods, so we still have this issue and have lots of catch procedures in place to work around it, so solving this would be great for me personally.

jreback commented 8 years ago

@CarstVaartjes oh I think you misunderstand. This would NOT be a replacement of the existing M8[ns] that would be WAY too disruptive.

Let me re-iterate, this is a bit non-trivial as it touches most of pandas.

ifmihai commented 8 years ago

Trivial or non trivial, if lots of users are affected (me included), then it should be given priority.

It would be very interesting to have a poll on this, to see the actual reality

jreback commented 8 years ago

@ifmihai are you volunteering? priority is determined by whom does things. Even if lots of people want it that wouldn't determine priority, sure it might influence it, but it still comes down to what contributors (all volunteers) actually want to work on / have time.

ifmihai commented 8 years ago

@jreback I agree with what you say. English is not my first language, I hope i didn't upset anyone. I guess it would have been better to say: "it should be given more priority". That's why i added the poll idea. I didn't mean to sound awkward or disrespectful to you guys, who actually implement features in pandas.

as for volunteering, the issue is completely over my head, completely and beyond

shoyer commented 8 years ago

Of course this adds complexity, so maybe it makes sense to just add a single new supported dtype (at least at first, e.g.maybe M8[ms]). which I think gives appropriate ranges.

Agreed, I would consider adding only one new resolution, or at the very least constraining ourselves to units that divide a second, which keeps all the math changes very straightforward (just factors of 1000).

What datetime64 resolutions are actually necessary? For climate purposes, I think us resolution would suffice and conveniently matches the resolution of Python's builtin datetime. See this page for the time span for various datetime units: http://docs.scipy.org/doc/numpy-1.10.1/reference/arrays.datetime.html#datetime-units

jreback commented 8 years ago

@ifmihai no worries.

I think what might be helpful is for an e-mail to go out to:

https://groups.google.com/forum/#!forum/pydata see if can get some feedback (put this issue in there) and have feedback posted here and consolidate.

E.g. what are various usecases / examples / pseudo-code of where this would be used (and the current work-arounds). I know that people want it, but I agree some more voices would be helpful to shed light on how it would be used (e.g. is say implementing M[ms] good enough (for now). do people tend to have data in say M8[ms] THEN add in data at a lower frequency

ifmihai commented 8 years ago

@jreback

do people tend to have data in say M8[ms] THEN add in data at a lower frequency

I don't understand what you mean. can you reformulate in a form of a question, for me to post it on pydata google group? (i just have to add this question, and then i will post the poll there)

jreback commented 8 years ago

So the main issue is casting.

In [1]: s = Series(pd.date_range('20130101',periods=3,freq='s'))

In [2]: s
Out[2]: 
0   2013-01-01 00:00:00
1   2013-01-01 00:00:01
2   2013-01-01 00:00:02
dtype: datetime64[ns]

In [3]: s[2] = pd.Timestamp('2013-01-01 00:00:02.123456789')

In [4]: s
Out[4]: 
0   2013-01-01 00:00:00.000000000
1   2013-01-01 00:00:01.000000000
2   2013-01-01 00:00:02.123456789
dtype: datetime64[ns]

If for example [2] was actually datetime64[s] freq. Then [4] would have to change the freq. This is not that big of a deal (as we tend to do this as needed for int/float conversions and such things), but can get somewhat tricky.

Further, it is possible that something that can be represented in 1 freq (e.g. in 'M8[s]' you can represent Timestamp('3000-01-01'), while it is not possible in M8[ns]). So these would again have to convert (or even convert to object).

So this just adds more complexity and more cases.

michaelaye commented 8 years ago

I think the argument for a new frequency default (or an additional one) to have it at the same frequency as datetime and numpy defaults, so us, sounds like a good argument to me. As another datapoint, in spacecraft data analysis, while the spacecraft clock kernel often carries a very high precision, microseconds or nanoseconds, instrument data times, like exposure start time and the like is often only recorded with millisecond precision. So us would definitely be enough for those use cases. This might be different for the Gravitational wave mission LISA Pathfinder, but they are an exception. ;)

jreback commented 8 years ago

don't all of these spacecraft actually use JavaScript? internally? or is that just the telescopes?

shoyer commented 8 years ago

I'm pretty sure that if you proposed running a non-statically typed language on spacecraft you would laughed at. Data analysis is a different story, though. On Sat, May 21, 2016 at 2:34 PM Jeff Reback notifications@github.com wrote:

don't all of these spacecraft actually use JavaScript? internally? or is that just the telescopes?

— You are receiving this because you modified the open/close state. Reply to this email directly or view it on GitHub https://github.com/pydata/pandas/issues/7307#issuecomment-220801542

michaelaye commented 8 years ago

What he said. ;)

jreback commented 8 years ago

http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=6187392&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6187392

michaelaye commented 8 years ago

Wow, interesting, that explains all the delays! ;) But I still think, that's the telescope operation control and planning, I found documents describing the Javascripts engine and how it interfaces with the flight software. In this document one can see references to C++, and how they use UML and complexity analyses to check their software. That also would explain some of the delays, as all the spacecraft I worked on (Dawn, MRO, BepiColombo) used simply a C-based spacecraft control and even our instruments had to compile their interfaces to the spacecraft memories and housekeeping in C. I would consider C++ already quite a jump in complexity, certainly not normal for a mission, but then even on top of that a complete Javascript engine? Wow, this mission is gonna be complex...

jreback commented 8 years ago

yeah I heard its an 'older' JS engine, but JS just the same. I know ~0 about this, just heard from some people I know..

ifmihai commented 8 years ago

I forgot to put here the link of the poll on pydata google group Poll

up to now, against nanoseconds: 8 users not interested: 2 users some of the answers are very nice, showing user cases.

I hope more people will vote, as people seem to really feel frustration over this limitation. microsecond seems to be common ground.

wesm commented 8 years ago

I've just skimmed this thread for the first time.

I do not think that this issue can be reasonably placed in scope until we come up with a development plan to overhaul the handling of type metadata more generally, as we discussed at some length in January on the mailing list. The technical debt that would be created by shoehorning multiple datetime resolutions into pandas's current implementation would be, to me, unacceptable. As is, having the dichotomy between Period and datetime64 is bad enough.

This is a hard enough problem that affects enough people that it cannot be resolved by the kinds of arguments that have been presented. I was effectively responsible for all of the technical decisions and a significant portion of the time series development in pandas 0.8 which forms the semantics of how time series currently work in pandas.

To give context, many commercial time series databases such as KDB+ have moved to nanoseconds as the default resolution (https://kx.com/press-releases/kx-delivers-nanosecond-timestamps-and-more-in-new-version-of-kdb.php). In early 2012, having just observed the development quagmire (see NumPy 1.6 and 1.7) surrounding the datetime64 dtype in NumPy, I felt (based on my biased experiences working in the financial world) it was a good idea (from a development simplification point of view) to have a single data type, rather than a parametric data type, modeling occurrences at points in time (rather than associated with some time interval). The concession to support more use cases was to adapt the code from scikits.timeseries for other timeperiods. I stand by the decision -- people have reported using pandas for nanosecond-level analysis of data that has been recorded during our lifetimes, and in my opinion this is an important use case to support. Sub-nanosecond data is very much in the domain of nuclear experiments and so forth. As for other scientific use cases of pandas: realize that pandas's plurality use case (if we were truly omniscient) nowadays is in the analysis of data generated by business processes occurring since the advent of computers.

To be clear: I am not dismissing your needs, which are valid. I understand that it is an annoyance, but in addition to the complexity of the augmentation, it would almost certainly have disruptive consequences on users which have years' worth of (potentially untested) code assuming nanoseconds. To really do this properly, we would need to introduce a transitional period in which users have the opportunity to fix their code that assumes nanoseconds (and may well treat the timestamps as integers for, say, truncation purposes) before we change the default unit, if that's what we decide to do.

I'm supportive of adding multiple timestamp resolutions, but there needs to be 1) a significant commitment of developer time, 2) preferably financial support, and 3) it needs to accompany a more significant retooling of pandas's type metadata (specifically: a unification of pandas custom type metadata and numpy type metadata to have a single consistent logical type metadata description for pandas -- the physical representation can continue to be NumPy arrays with physical NumPy dtypes). I had hoped that my schedule this year would turn out more favorably for me to do this work myself on a volunteer basis, but given that it's June it's not looking good -- my spare cycles are going into more-urgently-needed book revisions.

I get that this is hard and potentially frustrating -- pandas is now and 8 year-old codebase which makes such changes even more difficult as time goes on. All the more reason we may benefit from a pandas 1.0 breaking release (similar to the Python 3 break), with a pandas 0.X.Y LTS bugfix branch, at some point in the future.

shoyer commented 8 years ago

Thanks for sharing your perspective, @wesm. I'm pretty much in complete agreement with you.

jreback commented 8 years ago

As a followup on @wesm comments. Periods are a nice alternative for representing out-of-bounds timestamp ranges, see docs here, so you can pretty easily use these.

Recently (thanks to @sinhrks , @MaximilianR ) these are becoming more and more of a first class type.

Certainly, this introduces more user mental effort that using a single less-detailed Timestamp type (and ignoring that Timestamps are point-in-time while Periods are time-spans) this is a practical solution (and has been since 0.8).

and in fact these can represent really long ranges

In [15]: pd.Period(datetime.datetime(1,1,1),freq='us')
Out[15]: Period('0001-01-01 00:00:00.000000', 'U')

In [17]: pd.Period(datetime.datetime(3000,1,1),freq='us')
Out[17]: Period('3000-01-01 00:00:00.000000', 'U')
CarstVaartjes commented 8 years ago

Hi,

I understand that inside Pandas the Period works really well. But in terms of IO to/from Pandas I'm less sure, typical use cases for me are:

I think all of these default to numpy datetimes (which in itself does support it, but is overruled by the standard ns casting); so there's lots of uncertainties there for me. Mind you, I very much understand the headache involved here and I love Pandas and you guys for all the effort you are putting into it. Just wish I had a time machine and go back to bribe Wes with beers for two things (us instead of ns datetimes + grouping not dropping nan values) :)

ifmihai commented 8 years ago

Following what @wesm said

All the more reason we may benefit from a pandas 1.0 breaking release (similar to the Python 3 break), with a pandas 0.X.Y LTS bugfix branch, at some point in the future.

This was on my mind since the beginning of this issue. I also see it as the most common sense.

Too bad that only few people voted on pydata google group. 10 users is not enough.

jzwinck commented 8 years ago

Probably the most likely people to find this page (or the poll) and speak are those who are dissatisfied with nanoseconds. But there are those of us who are happy with nanoseconds. I am.

Here are some uses of precision finer than 1 microsecond:

In the unlikely event that Pandas switched from always nanoseconds to always microseconds, we would have to stop using its time features, and store nanoseconds as raw int64. Even if we stipulate that nanosecond precision has no practical use to humans, we need to be able to convey timestamps with full accuracy between systems (e.g. to do database queries).

I do sympathize with those whose timestamp needs exceed the life expectancy of people or nations, but systems are generating more sub-microsecond timestamps, and this trend will not reverse. Adding support for longer horizons is good, but we shouldn't lose our nanos.

jbrockmendel commented 1 year ago

In 2.0 we support "ns", "us", "ms", and "s". Closing as complete.

Gabriel-Kissin commented 1 month ago

In 2.0 we support "ns", "us", "ms", and "s". Closing as complete.

In case anyone else gets to the bottom of this thread:
Link to the timeseries user guide, and the what's new in pandas 2.0, which describe this.