unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.
https://icu4x.unicode.org
Other
1.34k stars 174 forks source link

Persian calendar conversion uses wrong algorithm #4713

Closed roozbehp closed 5 months ago

roozbehp commented 6 months ago

Looking at https://github.com/unicode-org/icu4x/blob/main/components/calendar/src/persian.rs, it looks like the “proposed” 2820-year cycle arithmetic calendar has been implemented.

Unfortunately, that algorithm is wrong and fails in exactly a year from now, on March 20, 2025. While it should return 30 Esfand 1403 AP, it mistakenly returns 1 Farvardin 1404 AP.

The correct algorithm, as specified in Calendrical Calculations book, is the astronomical algorithm. That's what is followed on the ground (although with an unspecified location). Since that is too complex to implement, a simple arithmetic algorithm that works at least until ~2090 CE has been implemented in ICU4C and ICU4J. (See the comments at top of https://github.com/unicode-org/icu/blob/main/icu4j/main/core/src/main/java/com/ibm/icu/util/PersianCalendar.java). That simple algorithm should be implemented in ICU4X as soon as possible, since we just entered the year 1403 AP, which is leap on the ground, but ICU4X miscalculates as non-leap.

Here's a quote from an email I sent to @echeran, @sffc, @Manishearth, @mihnita, and others on June 21, 2023:

The Persian calendar doesn't need a complex astronomical implementation for the foreseeable future. I don't recommend implementing an astronomical Persian calendar, except for research purposes. Instead, implement the Persian calendar the same way ICU and ICU4J implement it, which is a simple 33-year cycle. It matches the practice for many years into the future.

sffc commented 5 months ago

Thank you @roozbehp for highlighting this.

As you noted, Calendrical Calculations specifies two Persian calendar methods. The algorithm in section 15.2 uses astronomical calculations:

persian-new-year-on-or-before(date) = solar-longitude (midday-in-tehran (day)) <= spring + 2deg

The other algorithm in 15.3 uses the 2820 cycle.

I don't see one in the book that uses a pure 33-year cycle. Is the ICU implementation the standard for that? What are the other pros and cons of that approximation versus the 2820 approximation?

When we implemented the calendrical calculations in ICU4X, we endeavored to follow the Reingold algorithms in order to not be the source of truth of any approximations.

sffc commented 5 months ago

My understanding of the situation is: the Reingold text published an approximation for the Persian calendar, which we are calling the 2820 approximation, and that approximation is wrong in 2025 CE. @roozbehp suggests a different approximation, which works between 1925 and 2090 CE.

Here's one possible approach we could take. In the Islamic and Chinese calendars, we implement a table-based fast path, where we ship precomputed year/month data for a certain range (I think +/- 100ish years by default), and fall back to the astronomical calculations for dates outside that range. We could move Persian over to using that framework, and throw out the 2820 approximation.

What do you think @roozbehp @Manishearth @echeran?

echeran commented 5 months ago

I ran a small test and confirmed that ICU4J does indeed give us Persian date 1403-12-30 for Gregorian date 2025-03-20, which matches the book's persian-from-fixed function that implements the astronomical calendar. Meanwhile, we saw that the 2820-based arithmetic algorithm, which is arithmetic-persian-from-fixed in the book, gives Persian date 1404-01-01. So in other words, a Persian date gets skipped when going from consecutive Gregorian days:

offset Gregorian ICU4J astronomical (persian-from-fixed) 2820-based (arithmetic-persian-from-fixed)
-1 2025-03-19 1403-12-29 1403-12-29 1403-12-29
0 2025-03-20 1403-12-30 1403-12-30 1404-01-01
1 2025-03-21 1404-01-01 1404-01-01 1404-01-02
40 2025-04-29 1404-02-09 1404-02-09 1404-02-10
400 2026-04-24 1405-02-04 1405-02-04 1405-02-04
roozbehp commented 5 months ago

Trying to answer the questions raised, as well as adding a few points:

  1. "I don't see one in the book that uses a pure 33-year cycle." That is actually briefly mentioned on top of page 265: "this 33-year cycle agrees with the astronomical calendar over an even longer period, 1046–1468 AP (1621–2421 Gregorian) [sic?]." It's unfortunate that the book doesn't spend more pages on this simpler and more accurate arithmetic calendar instead of the 2820-year cycle one, but the 33-year cycle one is very well-known in Iran and is widely implemented in software.

  2. "Is the ICU implementation the standard for that?" Unfortunately I don't understand the question here.

  3. "What are the other pros and cons of that approximation versus the 2820 approximation?" The pros are that a) it's accurate for a longer period of time; b) it's simpler and slightly faster; and c) It's the same algorithm implemented in ICU and ICU4J. I don't know of any cons compared to the 2820-year cycle.

  4. "When we implemented the calendrical calculations in ICU4X, we endeavored to follow the Reingold algorithms in order to not be the source of truth of any approximations." Unfortunately we can't close our eyes to the actual usage on the ground. All the paper calendars published in Iran have 1403 AP as a leap year, which doesn't match the calculations of the 2820-year cycle.

  5. "@roozbehp suggests a different approximation, which works between 1925 and 2090 CE." That is correct, with the caveat that it may work for even a longer period of time, more on that below.

  6. "we ship precomputed year/month data for a certain range (I think +/- 100ish years by default), and fall back to the astronomical calculations for dates outside that range. We could move Persian over to using that framework, and throw out the 2820 approximation. What do you think?" That will work for me, with one caveat: the astronomical algorithm from the book will need to be slightly modified. More on that below.

  7. The Calendrical Calculations book, in the footnote on page 259, mentions that someone wrote to them and claimed that the 52.5 East meridian should be used instead of Tehran as the locale for the astronomical calculations, but that the authors' reading of the law doesn't specify a locale. While it's correct that the law doesn't specify a locale, I can confirm from conversations with calendar authorities in Iran as well as several Persian publications that the Iranian calendar authorities use the 52.5 E meridian in their calculations as a "neutral" location. The book mentions that the differences are not too many, but that the next such difference happens in 2091 CE. Indeed, according to the astronomical calculations in the book, 1469 AP would be non-leap and 1470 AP would be leap (this matches the tables published by the same authors in "Calendrical Tabulations 1900–2200"). But according to the Iranian calendar authorities, 1469 AP is leap and 1470 AP is not: https://calendar.ut.ac.ir/Fa/News/Data/Doc/KabiseShamsi1206-1498-new.pdf. If you make that locale change to the astronomical algorithm in the book, it will match the table published by the Iranian calendar authority linked above. (A good test date is 2091-03-20. According to the Iranian calendar authority tables and the modified astronomical algorithm using the 52.5 E meridian, it will be 1469/12/30 AP. According to the original astronomical algorithm and the Calendrical Tabulations book, it would be 1470/1/1 AP.)

  8. I have a partial Python port of the Lisp code from Calendrical Calculations at https://github.com/roozbehp/persiancalendar, which I have modified according to the facts on the ground (which is just changing the locale used for calculating the true noon to 52.5 E meridian). The output of that code matches the table published by the Iranian calendar authority. It also matches the 33-year cycle algorithm implemented in ICU/ICU4J until early 2124 CE (the first day when they don't match is 2124-03-20, which should be 1503/1/1 AP, but the 33-year cycle incorrectly calculates as 1502/12/30 AP.)

  9. I think the quick solution going forward is simply replacing the algorithm used in ICU4X by the one used in ICU/ICU4J. That covers us until early 2124 CE, about a hundred years from now. (For the years before 1925 CE it really doesn't matter what we do, because it's proleptic anyway: this calendar didn't exist in this form until 1925 CE.)

  10. When we find the time, we can add (a modified) astronomical code for dates on or after 2124-03-20. We have about a hundred years to do that! 😉

Manishearth commented 5 months ago

Is the ICU implementation the standard for that?

IMO this is a misunderstanding of how these calendar algorithms work: For most of them the standard is whatever the calendar authorities[^2] say (often a proxy for what happens in the actual skies), because that's actually what people use on the ground. Aside from the Hebrew calendar[^1] most of these things have no "standard" algorithm and rather have a general set of rules. It is rare that the relevant calendar authorities will actually publish an algorithm, it is sometimes even possible that the authorities will use different algorithms at different times.

This is also a reason as to why I'm not too bothered with ICU4X's Chinese calendar implementation diverging from ICU4C in a couple centuries. There's no right answer for those dates.

Basically, whenever there is ground truth to follow, we should follow it.

I think the quick solution going forward is simply replacing the algorithm used in ICU4X by the one used in ICU/ICU4J. That covers us until early 2124 CE, about a hundred years from now. (For the years before 1925 CE it really doesn't matter what we do, because it's proleptic anyway: this calendar didn't exist in this form until 1925 CE.)

This seems good.

[^2]: I say "authorities" loosely, in some cases it's more just "whoever is in the business of publishing calendars" like कालनिर्णय in India. They probably do some amount of coordination for future years and will notice when there is potential for mismatch. [^1]: Which follows a very rigorous algorithm with well-defined approximations, where the approximations are intentional and expected. Put differently, that calendar drifts out of synch with the moon and it's okay with it (though if it gets too bad I'd expect them to tweak it eventually)

sffc commented 5 months ago

I apologize for sloppily using the word "standard." What I intended to ask was whether the ICU code is the source of truth for this approximation, as opposed to some textbook or wikipedia page or blog post Roozbeh made or something. I.e., what is the source that goes into the documentation.

sffc commented 5 months ago

In general I prefer not implementing approximations in the ICU4X code and instead using speedups like we've done with the Chinese calendar. I consider it a bug that the Persian code currently uses an approximation, and the bug is worse given that the approximation fails next year, but any approximation is still a bug. We have a system for implementing fast, non-approximations for calendrical calculations, so why not use it.

sffc commented 5 months ago

ICU4X Working Group discussion:

Conclusion:

Shane's version:

LGTM: @sffc @manishearth

Manish's version:

Agreed: @Manishearth, @younies, @echeran

roozbehp commented 5 months ago

Thanks a lot for the discussion and the detailed notes. I have a question: what is the theoretical date range we care about here? I have been thinking about compressed tables generated using the (slightly modified) astronomical algorithm from the Calendrical Calculations book (Persian calendar being relatively regular, we just need to keep leap year meta patterns). But I don't think the astronomical methods in the book return meaningful results beyond, say, 9999 CE (which is the end of ISO 8601), if that. Creating tables for all of i32 will take a lot of space and is probably noise anyways.

roozbehp commented 5 months ago

BTW, since the Iranian calendar authority website is not always accessible from outside Iran (and it's in Persian anyways), I just created a data file based on it and put it at: https://github.com/roozbehp/persiancalendar/blob/main/kabise.txt

Manishearth commented 5 months ago

I have a question: what is the theoretical date range we care about here?

I think as far in the past as that version of the calendar has been in use (so e.g. even though the Japanese calendar is old, the specific gregorian-based version they use now is relatively new), and my general take is 100 years into the future.

roozbehp commented 5 months ago

I still don't know Rust, so I did a prototype for the replacement algorithm in Python. The code is very closely based on the Calendrical Calculations code, but with the 2820-year rule replaced by the 33-year rule and an override table. Its results match the (modified) astronomical algorithm from 1178 to 3000 AP (about 1799 to 3621 CE). If you don't need to go that much into the future, you can just drop some rows from the override table.

The code is very short and fast. It just needs a set membership lookup, which I hope Rust can do cheaply.

Here's the code (Apache-licensed, since it's based on the Calendrical Calculations code): https://github.com/roozbehp/persiancalendar/blob/main/persiancalendar_fast.py

roozbehp commented 5 months ago

Ok, I couldn't sleep so I taught myself some Rust and created a pull request to replace the existing arithmetic rule with the one in ICU and ICU4J. This should fix the immediate concern about March 20, 2025 and will work correctly until the end of 1501 AP (~2122 CE). I can work on getting the override table in after that.

sffc commented 5 months ago

I have a question. For distant dates, say 1000 or more years in the future, is there a drift between the 33-year and 2820-year cycles relative to what we predict to be the ground truth (the astronomical calculations)? For example, the 33-year cycle has a 100% accuracy rate for the next 100 years, and >90% for at least a few hundred years after that. We know that the 2820-cycle has a ~90% accuracy rate for the next 100 years. But 1000 years in the future, would the 2820 cycle hold its 90% accuracy rate while the 33 cycle drifts away?

What I'm getting at is that we could perhaps keep both models and switch to 2820 over a certain bound.

roozbehp commented 5 months ago

For distant dates, say 1000 or more years in the future, is there a drift between the 33-year and 2820-year cycles relative to what we predict to be the ground truth (the astronomical calculations)? For example, the 33-year cycle has a 100% accuracy rate for the next 100 years, and >90% for at least a few hundred years after that. We know that the 2820-cycle has a ~90% accuracy rate for the next 100 years. But 1000 years in the future, would the 2820 cycle hold its 90% accuracy rate while the 33 cycle drifts away?

I did a quick comparison for the period 2403 AP (exactly 1000 years into the future) to 9378 AP (9999 CE, end of ISO 8601). The number of leap years that was wrong for the 33-year algorithm was 2511, compared to 2996 for the 2820-year algorithm. I also compared the period 1304 AP (introduction of the caldendar in 1925 CE) to 2403 AP (exactly a thousand years into the future). That came up with 70 wrong leap years for the 33-year rule and 205 wrong leap years for the 2820-year rule.

What I'm getting at is that we could perhaps keep both models and switch to 2820 over a certain bound.

Looks like the 2820-year is just simply worse over any period we may care about, even for years far into the future.

sffc commented 5 months ago

Thanks for the comparison @roozbehp!

The number of leap years that was wrong for the 33-year algorithm was 2511, compared to 2996 for the 2820-year algorithm.

By "wrong" I assume you mean relative to the astronomical calculation?

If the calendar correctly determines a leap year, does it imply that the dates in that year (new year and month lengths) will be correct?

roozbehp commented 5 months ago

By "wrong" I assume you mean relative to the astronomical calculation?

Yes. The “modified” astronomical calculation (using the meridian the Iranian calendar authority uses). That's the only source of truth we have outside of the period that the Iranian calendar authority publishes.

If the calendar correctly determines a leap year, does it imply that the dates in that year (new year and month lengths) will be correct?

Almost, but not exactly. If a calendar correctly determines a leap year, it would imply that the last day in that year and all the dates in the next year will be correct. It's a very good test for the Persian calendar, since by knowing all the leap years, you can quickly and unambiguously deduce everything else. In the Persian literature about the Persian calendar, certain years being leap or not is the shorthand for correctly calculating the calendar.

roozbehp commented 5 months ago

Thanks everybody. I just filed https://unicode-org.atlassian.net/browse/ICU-22736 to sync ICU4C and ICU4J with this.