Extremely inefficient instantiating a distant local date

autarch commented 7 years ago

Migrated from rt.cpan.org #47671 (status was 'open')

Requestors:

mschwern@cpan.org (@schwern)

From mschwern@cpan.org (@schwern) on 2009-07-08 02:01:30:

$ time perl -wle 'use DateTime; print DateTime->new( year => 3058, time_zone => "local" )' 3058-01-01T00:00:00

real 0m9.558s user 0m8.619s sys 0m0.165s

This occurs whether time_zone is "local" or set to a zone like "America/Chicago".

Profiling reveals that this makes tens of thousands of method calls. The big hog is 10000 calls to DateTime::set_time_zone which is being called by add_duration() which all traces back to DateTime::TimeZone::_generate_spans_until_match looping over OlsenDB calls 2000+ times.

This all traces back to DateTime->offset called by DateTime::_handle_offset_modifier and DateTime::_calc_local_rd

Its possible one could hold off doing whatever it is that calc_local_rd does, considering that information is not necessary just to instantiate an object, but the root problem is _generate_spans_until_match's algorithm. I'm not sure what its purpose is, but it appears to be iterative by year resulting in Y iterations where Y is the year being search for (minus some arbitrary lower limit).

If the whole purpose is to determine the time zone offset, this algorithm is much more efficiently done as a binary search. Even if we assume a 64 bit year it will take at most 64 iterations to zero in on the right year. And so on for the month, day, hour, minute and second.

If its just searching the Olsen DB for applicable zone changes I don't pretend to understand how that works but there's got to be a better way to do it. C libraries can do it in milliseconds and Perl is just not that slow. Perhaps with some sort of... database?

autarch commented 7 years ago

From autarch@urth.org (@autarch) on 2009-07-08 02:08:38:

On Tue, 7 Jul 2009, Michael G Schwern via RT wrote:

Its possible one could hold off doing whatever it is that calc_local_rd does, considering that information is not necessary just to instantiate an object, but the root problem is _generate_spans_until_match's algorithm. I'm not sure what its purpose is, but it appears to be iterative by year resulting in Y iterations where Y is the year being search for (minus some arbitrary lower limit).

It's generating all the time zone changes from X years out (by default, each time zone ships with 10 years of pre-calculated data. The rest is generated as needed.

Note that this only affects zones with DST rules in effect.

If the whole purpose is to determine the time zone offset, this algorithm is much more efficiently done as a binary search. Even if we assume a 64 bit year it will take at most 64 iterations to zero in on the right year. And so on for the month, day, hour, minute and second.

The problem being that there's nothing to search until it's generated.

While I'm sure this could be made faster, I'm not sure it's worth too much effort. Project Olson time zones that far in the future is pretty silly. Time zones are political entities, and it's almost certain that rules in effect now won't be in effect in the year 3058.

Just use UTC.

-dave

/============================================================ http://VegGuide.org http://blog.urth.org Your guide to all that's veg House Absolute(ly Pointless) ============================================================/

autarch commented 7 years ago

From mschwern@cpan.org (@schwern) on 2009-07-08 03:02:01:

On Tue Jul 07 22:08:38 2009, autarch@urth.org wrote:

The problem being that there's nothing to search until it's generated.

Are you saying it regenerates the database at run time every time? How about doing it once when DateTime::TimeZone is installed?

While I'm sure this could be made faster, I'm not sure it's worth too much effort. Project Olson time zones that far in the future is pretty silly. Time zones are political entities, and it's almost certain that rules in effect now won't be in effect in the year 3058.

Just use UTC.

What the hell kind of answer is that?! I'll take "yeah, that sucks but I don't have the tuits to optimize it. Here's some thoughts on how to do it..." but not "use UTC". Come on. While time zones are annoying, you can't just blow them off! Especially as the maintainer of the most heavily used date module in Perl!

While I agree worrying about the niggling details of historical time zone changes 1000 years from now is silly, THE SUN WILL STILL BE IN THE SKY and that's what a time zone offset basically reflects. And while optimizing the code might be hard we're not talking milliseconds here. That's 10 seconds to go forward a mere 1000 years. Even calculating 2109 takes a full second. 2150 is 1.3 seconds. That's unacceptable.

Another simple optimization would be to cache what the latest Olsen entry is and then just not bother looking beyond that.

Somewhat apropos, for dates far in the past, before time zones, I've seen 64 bit Linuxen use the lat/long for the city in question which is pretty clever.

autarch commented 7 years ago

From autarch@urth.org (@autarch) on 2009-07-08 03:37:39:

On Tue, 7 Jul 2009, Michael G Schwern via RT wrote:

  Queue: DateTime
Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=47671 >

On Tue Jul 07 22:08:38 2009, autarch@urth.org wrote:

The problem being that there's nothing to search until it's generated.

Are you saying it regenerates the database at run time every time? How about doing it once when DateTime::TimeZone is installed?

Storing the changes in memory uses up a fair bit of memory.

Generating 1000+ years worth would make loading each time zone take much much more memory. It does it at runtime to optimize for the common case of not needing to look too far in the future.

Just use UTC.

What the hell kind of answer is that?! I'll take "yeah, that sucks but I don't have the tuits to optimize it. Here's some thoughts on how to do it..." but not "use UTC". Come on. While time zones are annoying, you can't just blow them off! Especially as the maintainer of the most heavily used date module in Perl!

It's the right answer. Using Olson time zones with far future dates just doesn't make any sense. These time zones exist for the purpose of dealing with the dictates of modern political entities.

While I agree worrying about the niggling details of historical time zone changes 1000 years from now is silly, THE SUN WILL STILL BE IN THE SKY and that's what a time zone offset basically reflects. And while

Yes, that's what the offset reflects. However, the DST changes reflect the random whims of idiot politicians.

If you just want an offset, use a static offset like '-0500'. That will be fast, since there's no DST rules to deal with.

Another simple optimization would be to cache what the latest Olsen entry is and then just not bother looking beyond that.

That doesn't make any sense. The last Olson entry is purely arbitrary based on how many years of pre-calculation the module ships with, and when I happen to generate it.

Or do you mean the offset without DST? Then someone else will complain that it's broken at some arbitrary date (like 11 years out).

For people who need to calculate far future dates, there are other solutions (UTC or some other fixed offset) that work. But either way, you as the user need to make a decision about what you want. I can't make that decision for you, I just provide the options.

If you want to take a stab at optimizing the changes generation, that's fine with me, but personally I don't think it's wortwhile, since there's no sane use case for Olson time zones in the far future.

Somewhat apropos, for dates far in the past, before time zones, I've seen 64 bit Linuxen use the lat/long for the city in question which is pretty clever.

A DT::TZ::LatLong would be very cool.

Actually, the Olson database approximates this, since pre-modern times it uses local mean time, which is an offset based off the lat/long of the named city.

-dave

/============================================================ http://VegGuide.org http://blog.urth.org Your guide to all that's veg House Absolute(ly Pointless) ============================================================/

autarch commented 7 years ago

From schwern@pobox.com on 2009-07-08 09:57:18:

autarch@urth.org via RT wrote:

On Tue Jul 07 22:08:38 2009, autarch@urth.org wrote:

The problem being that there's nothing to search until it's generated. Are you saying it regenerates the database at run time every time? How about doing it once when DateTime::TimeZone is installed?

Storing the changes in memory uses up a fair bit of memory.

Generating 1000+ years worth would make loading each time zone take much much more memory. It does it at runtime to optimize for the common case of not needing to look too far in the future.

Generating 1000+ years of what? The Olsen database doesn't go forward that far. And who says it has to all be in memory? There's any number of techniques to leave things on disk.

The C code figured this out 20 years ago!

Just use UTC. What the hell kind of answer is that?! I'll take "yeah, that sucks but I don't have the tuits to optimize it. Here's some thoughts on how to do it..." but not "use UTC". Come on. While time zones are annoying, you can't just blow them off! Especially as the maintainer of the most heavily used date module in Perl!

It's the right answer. Using Olson time zones with far future dates just doesn't make any sense. These time zones exist for the purpose of dealing with the dictates of modern political entities.

While I agree worrying about the niggling details of historical time zone changes 1000 years from now is silly, THE SUN WILL STILL BE IN THE SKY and that's what a time zone offset basically reflects. And while

Yes, that's what the offset reflects. However, the DST changes reflect the random whims of idiot politicians.

You're absolutely correct, it sucks and its broken. However, you maintain a date/time library which deals with time zones. You have to deal with the reality. Full stop.

If you just want an offset, use a static offset like '-0500'. That will be fast, since there's no DST rules to deal with.

I, in fact, don't want an offset at all. I just want a DateTime object representing the date I gave it. Why is it calculating the offset just to instantiate? Can't it wait until I ask for it?

Another simple optimization would be to cache what the latest Olsen entry is and then just not bother looking beyond that.

That doesn't make any sense. The last Olson entry is purely arbitrary based on how many years of pre-calculation the module ships with, and when I happen to generate it.

Or do you mean the offset without DST? Then someone else will complain that it's broken at some arbitrary date (like 11 years out).

Ok, maybe I'm missing something here. If I hand you something like "July 9th, 2009 America/Chicago" and ask what the offset is taking into account DST you look at the last DST rule before that date and apply it. Let's say the last rule change was 2007 (which I think it was) so you look at that and apply it and that's that, right?

Why would the algorithm be any different for 3009? What's all this extra calculation you're talking about? Are you generating rules for every year up to 3009 or something? Wouldn't that just be repeating the same calculation over and over?

For people who need to calculate far future dates, there are other solutions (UTC or some other fixed offset) that work. But either way, you as the user need to make a decision about what you want. I can't make that decision for you, I just provide the options.

If you want to take a stab at optimizing the changes generation, that's fine with me, but personally I don't think it's wortwhile, since there's no sane use case for Olson time zones in the far future.

Look, all I know is every other datetime library that goes past 2038 doesn't take 10 seconds to make this calculation. Something is wrong here.

Somewhat apropos, for dates far in the past, before time zones, I've seen 64 bit Linuxen use the lat/long for the city in question which is pretty clever.

A DT::TZ::LatLong would be very cool.

Actually, the Olson database approximates this, since pre-modern times it uses local mean time, which is an offset based off the lat/long of the named city.

That sounds like it.

The past has a vote, but not a veto. -- Mordecai M. Kaplan

autarch commented 7 years ago

From mschwern@cpan.org (@schwern) on 2009-07-08 19:57:25:

I've dug into the DateTime::TimeZone code and now I understand why we're talking past each other. I'm talking in terms of Olsen rules database (after this year this time zone changes to DST at 2am on the 1st Sunday) and you're talking in terms of these "span" things which cache the calculation, one for every DST change. Now I see why you'd think it has to be expensive and why you'd think it was untenable to store.

Since they're linear, and since the only way to search through them is binary, I can see why TimeZone is locked into having to generate them all and why it can't just leave them on disk. What are they indexed by? Julian days? What's "rd" as in "ymd2rd"?

I understand the approach, but there's got to be a better way to do it that doesn't involve calculating each DST change up to the current year. Cached perhaps. I also think its possible to make use of the fact that a DateTime object already knows what year it is to perhaps just cache this all in a DBM file.

autarch commented 7 years ago

From autarch@urth.org (@autarch) on 2009-07-08 21:21:33:

On Wed, 8 Jul 2009, Michael G Schwern via RT wrote:

  Queue: DateTime
Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=47671 >

I've dug into the DateTime::TimeZone code and now I understand why we're talking past each other. I'm talking in terms of Olsen rules database (after this year this time zone changes to DST at 2am on the 1st Sunday) and you're talking in terms of these "span" things which cache the calculation, one for every DST change. Now I see why you'd think it has to be expensive and why you'd think it was untenable to store.

Since they're linear, and since the only way to search through them is binary, I can see why TimeZone is locked into having to generate them all and why it can't just leave them on disk. What are they indexed by? Julian days? What's "rd" as in "ymd2rd"?

They're indexed by UTC days.

I understand the approach, but there's got to be a better way to do it that doesn't involve calculating each DST change up to the current year. Cached perhaps. I also think its possible to make use of the fact that a DateTime object already knows what year it is to perhaps just cache this all in a DBM file.

Probably the quickest, most efficient way would be to have the time zone generator spit out some code that encapsulates the decisions that need to be made, one when given UTC values, another for local. This would require some changes to DateTime as well, since right now it passes in just the rata die day and seconds.

-dave

/============================================================ http://VegGuide.org http://blog.urth.org Your guide to all that's veg House Absolute(ly Pointless) ============================================================/

autarch commented 7 years ago

From schwern@pobox.com on 2009-07-08 23:43:42:

autarch@urth.org via RT wrote:

<URL: https://rt.cpan.org/Ticket/Display.html?id=47671 >

On Wed, 8 Jul 2009, Michael G Schwern via RT wrote:
  Queue: DateTime
Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=47671 >

I've dug into the DateTime::TimeZone code and now I understand why we're talking past each other. I'm talking in terms of Olsen rules database (after this year this time zone changes to DST at 2am on the 1st Sunday) and you're talking in terms of these "span" things which cache the calculation, one for every DST change. Now I see why you'd think it has to be expensive and why you'd think it was untenable to store.

Since they're linear, and since the only way to search through them is binary, I can see why TimeZone is locked into having to generate them all and why it can't just leave them on disk. What are they indexed by? Julian days? What's "rd" as in "ymd2rd"?
They're indexed by UTC days.

Presumably this is the number of UTC days since some epoch? What's the epoch?

I understand the approach, but there's got to be a better way to do it that doesn't involve calculating each DST change up to the current year. Cached perhaps. I also think its possible to make use of the fact that a DateTime object already knows what year it is to perhaps just cache this all in a DBM file.

Probably the quickest, most efficient way would be to have the time zone generator spit out some code that encapsulates the decisions that need to be made, one when given UTC values, another for local. This would require some changes to DateTime as well, since right now it passes in just the rata die day and seconds.

Ah HA! That's the mysterious "rd".

Alligator sandwich, and make it snappy!

autarch commented 7 years ago

From autarch@urth.org (@autarch) on 2009-07-09 01:32:56:

Can we take this conversation to the datetime@perl.org list?

/============================================================ http://VegGuide.org http://blog.urth.org Your guide to all that's veg House Absolute(ly Pointless) ============================================================/

houseabsolute / DateTime-TimeZone

Extremely inefficient instantiating a distant local date #12