leap year calculation - Githubissues

magnetizer commented 12 years ago

Hello everybody,

hroptatyr, thanks for the great work on dateutils. I am playing around with it for a few minutes now and I like it already...;-)

Although, I think I found a bug that should be squished, the determination whether a year is a leap year or not.

If you run the following line from bash:

for i in {2000..2100}; do echo dadd $i-02-28 1d; done | bash

you will get the next day after each 28th of february for every year in the range from 2000 to 2100.

What I see from this, is that your code determines that a year is a leap year if it is evenly divisible by the number 4.

As far as I know, this is not the correct definition of a leap year.

To find out whether a year is a leap year sequentially follow the following steps:

0) precondition: the year is in the Gregorian calendar which started October 15, 1582 1) if the year is is evenly divisible by 4, it IS a leap year 2) if the year is is evenly divisible by 100, it IS NOT a leap year (although it is evenly divisible by 4) 3) if the year is is evenly divisible by 400, it IS a leap year (although it is evenly divisible by 100)

Here are some sources to check this: http://en.wikipedia.org/wiki/Gregorian_calendar http://en.wikipedia.org/wiki/Leap_year

Kind regards, André Kaldenhoven

hroptatyr commented 12 years ago

Hey Andre, thanks for the input :) Yes, I'm well aware of this 'problem', what I can offer is a configure option to turn on proper dates, maybe even make it the default, i.e. --enable-compliance or so, but consider the following:

dateutils primary focus is to aid financial applications, in this context it's very rare that dates beyond 2099 actually matter
we're the lucky buggers whose life-spans never touch any of the more complicated rules, 1901 to 2099 behave as if only rule 1 existed
saving some extra time by doing sloppy date arithmetics comes in handy when you have to process files in the order of many hundreds of billions of lines

Having said that, maybe to make users more wary of why this is, I should call my proposed configure option --enable-fast-arithmetic. What do you think?

Just a side note, to save you some pids the above example could be written

 for i in {2000..2100}; do echo ${i}-02-28; done | dadd 1d

magnetizer commented 12 years ago

Hey Sebastian,

ok, I can see your special use case when I was wondering about algorithmic correctness. I didn't think about using your software for billions of calculations where performance issues could come to life.

Being one of the "lucky buggers" I am still convinced it would be nice to make the software year 2.1k proof. What if we survive the collapse of peak oil?? Then there will still be people around in 2100 who will like to manipulate some dates for their financial calculations...;-) (Whether they will be using your code is another matter...;-))

I think the option "--enable-fast-arithmetic" would be a good solution! It could even stand for several different methods to improve calculation speed (one being the fast leap year calculation). If we are only talking about the leap year 'problem' something like '--enable-fast-leap-year-calculation' would be somewhat clearer, I suppose.

I will certainly follow the development of your code but please only put in the effort into this matter if you think it is worth it.

Greetz, André

PS: Thanks for the bash oneliner, I didn't think about this. Indeed this will save some pids, but then still I am not doing billions of calculations...;-)

PS2: Perhaps something smart can be done to minimize the number of calculations to determine leap years by making use of facts like: If the year is divisible by 400 it is most certainly divisible by 4 and by 100, so it is a leap year. Or perhaps calculation speed will improve by making use of things like dividing by 4 is the same as binary shifting right twice? I haven't thought this through (perhaps the C compiler already does this), but I am certain someone already looked into it...;-) If I think of something fast and correct I will let you know.

magnetizer commented 12 years ago

Hey Sebastian,

I just did 1 billion leap year calculations with a simple test script (range: 2,000<=year<10,000, 100,000 repetitions). Indeed, you need some extra time (37 seconds) using the exact calculation:

if((year % 4 == 0 && year % 100 != 0) || year % 400 == 0) return 1; /* leap / else return 0; / no leap */

instead of the 'sloppy' one:

if(year % 4 == 0) return 1; /* leap / else return 0; / no leap */

Here are the test results:

time ./exact_test real 7m53.707s user 1m38.358s sys 1m0.320s

time ./sloppy_test real 7m16.501s user 1m9.848s sys 0m47.031s

The difference of 37 seconds is about 8% performance gain. If this is a lot or not really depends on what you are about to do with it. For many purposes the difference will not be measurable (<1,000 calculations) and even for those huge numbers as 1 billion calculations the extra delay of 37 seconds on more than 7 minutes could be bearable for some people.

As your software is specially developed for the case of great numbers of calculations, I reaffirm that the optional use of "fast calculations" is a good idea.

Greetz, André

PS: I'm still looking to find a faster way than if((year % 4 == 0 && year % 100 != 0) || year % 400 == 0)

If I find something we can compare the results.

hroptatyr commented 12 years ago

Hey André, I've sorted this issue now, I agree with you, just leaving the correct code out to squeeze some microseconds out of my CPU is daft actually.

The fix is in: 0f8081f1 and 1fdff9fb

I've gone for --enable-fast-arith and disabled it by default. I don't think it's very helpful to overwhelm users with gazillions of --enable' and--disable' options. Users with more specific needs, like myself, can be expected to dig a bit deeper to find their gold, can't they? :)

hroptatyr commented 12 years ago

Hehe, well I doubt it's going to be much faster than that in that form.

The problem (according to my profiling) is NOT the actual computation it's the additional branches you introduce, which may cause pipeline stalling or polluted level1 caches.

magnetizer commented 12 years ago

Hey Sebastian,

wow, that is a fast respond. I will check out your updated code.

I really did not mean to set you up with extra work but thanks for the fix. In the meantime (just before your fix) I tested different ways to determine leap years and measure their performances. I computed all methods 1 billion times.

The execution time is somehow slower than the first time (7 min 50 sec) which is strange because this time I restarted my computer and deactivated unnecessary processes. So perhaps, it is a glitch because my machine is not a dedicated, thought-through testing machine. Nevertheless, I find the results interesting and I happily share them with you.

So, just for the record (do not feel obliged to implement anything of this), here are some test results. The differences are small and I am convinced we are looking at results which need further testing for confirmation.

I used different methods to define whether a year is a leap year (The code is changed a bit by copying everything to one line in order to save space...I can provide the complete original test scripts):

1) the 'sloppy' one: if (year % 400 == 0) return 1; else return 0; real 10m52.898s

2) the exact one that can be found everywhere on the internet: if ((year % 4 == 0 && year % 100 != 0) || year % 400 == 0) return 1; else return 0; real 11m15.363s

As expected, this is a bit slower than the 'sloppy' method.

3) first optimization attempt (reordering operands could help (a bit) since the second part (the complicated one) does not have to be evaluated if the first part is already true. It only helps if the year range is huge so that the first operand is 'true' sometime) : if (year % 400 == 0 || (year % 4 == 0 && year % 100 != 0)) return 1; else return 0; real 11m2.759s

Yes indeed, it seems to be faster than method 2) ;-)

4) second optimization attempt: if (year % 400 == 0) return 1; if (year % 100 == 0) return 0; if (year % 4 == 0) return 1; return 0; real 11m3.122s

Ouch, I thought this to be an optimization of method 3) but it shows the execution time is a little higher.

5) third optimization attempt (based on the idea that shifting + multiplying could be faster than a division) if (year % 400 == 0 || (year == (year >> 2) * 4 && year % 100 != 0)) return 1; else return 0; real 10m40.427s

Wow, this method even seems to be faster than the sloppy method...:-)

Clearly, this result needs confirmation (even if it is not as daunting as the 'faster than light neutrinos' CERN thinks to have found recently...;-) ). If this is a real optimization, I am sure some more tricks can be done. But then it is only a theoretical optimization because nobody cares about this seconds. And besides, you could be right that effects like pipeline stalling or polluted level1 caches could be the real bottleneck...;-)

Thus, Sebastian, never mind for my post, thanks for your quick reply and action and for closing this 'bug'.

Kind regards, André

hroptatyr / dateutils

leap year calculation #3