unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.
https://icu4x.unicode.org
Other
1.37k stars 176 forks source link

Loading cache data for lunar calendars #3933

Closed sffc closed 6 months ago

sffc commented 1 year ago

@Manishearth - This is 1.3 blocking because the API of a calendar that has no data versus one that does is different. We could theoretically remove constructors from these types for 1.3 and only do things through AnyCalendar. But then these APIs are functionally useless outside of AnyCalendar.

Parts:

Manishearth commented 1 year ago

There's a PR open for Chinese (https://github.com/unicode-org/icu4x/pulls), but islamic/hebrew also need similar work done.

sffc commented 1 year ago

Name of the constructor that does not load data:

Discussion:

Conclusion: new_always_calculating() for the no-data constructor. new() is the constructor with compiled data.

Next steps:

It would be nice to have Chinese ready with the pre-calculated caches in 1.3 but does not need to block.

LGTM: @Manishearth @sffc @robertbastian

Manishearth commented 1 year ago

4051 solves this for the milestone, kicking the can to 1.4

Manishearth commented 11 months ago

The plan is:

One thing I'm not sure about is if YearInfo should ever differ from what is stored in precomputed data. Currently the YearInfo equivalent in tree for chinese is:

pub(crate) struct ChineseBasedCache {
    pub(crate) new_year: RataDie,
    pub(crate) next_new_year: RataDie,
    pub(crate) leap_month: Option<NonZeroU8>,
}

whereas the precomputed cache is:

pub(crate) struct ChineseBasedCompiledData {
    pub(crate) new_year: RataDie,
    /// last_day_of_month[12] = last_day_of_month[11] in non-leap years
    /// These days are 1-indexed: so the last day of month for a 30-day 一月 is 30
    /// The array itself is zero-indexed, be careful passing it self.0.month!
    last_day_of_month: [u16; 13],
    ///
    pub(crate) leap_month: Option<NonZeroU8>,
}

(This packs down to three bytes in PackedChineseBasedCompiledData)

If we are going to store such data live, we might as well store PackedChineseBasedCompiledData everywhere, which simplifies the architecture down to "load PackedChineseBasedCompiledData into YearInfo, if you can't load from precomputed data then compute on your own". In the long run this may mean we don't need to pass in PrecomputedData everywhere, though in the short run I'd like to grant us that freedom.

However, if we make that call now, we can avoid having to pass in the compiled data for everything other than the until() and offset_days() which is nice.

Manishearth commented 11 months ago

Discussed a bit with @sffc:

In general we think that YearInfo and PrecomputedData will not need to differ on the data stored per-year (they may differ on the data storage format).

This means that ArithmeticDate only needs to worry about PrecomputedData for the offset/until functions, which is a far lower footprint.

The design becomes:

The main downside here is that when working without loaded data, it may be slower to calculate the YearInfo for an entire year since we will have to calculate things for multiple months at a time. This is not a cost we ever have to pay right now. On the other hand, it will probably amortize nicely.

Another downside is that this will increase the size of Date whenever our largest calendar decides to cache more things. This is a price we think is ok to pay since Date is mostly ephemeral.

Manishearth commented 10 months ago

ugh, found some gnarly stuff that will not be easy to optimize

https://github.com/unicode-org/icu4x/blob/c755786f6482daf0fb277ffdad001a53950082b3/components/calendar/src/chinese.rs#L311

This needs the number of days in the previous year. We might just also cache that but it's suboptimal, it's only needed for the rather rare case of week-of-year formatting, but it needs to be computed mandatorily when formatting

Manishearth commented 9 months ago

@sffc We need to make a decision based on the benchmarking results:

Consider adding a new() and deprecating new_calculating() (and the try_new...with_calendar(), if we are definitely not precomputing (otherwise add precomputing)

For background, currently Hebrew must be constructed with new_always_calculating()

Proposal: We commit to never precomputing, add ::new(), make Hebrew constructable as a singleton struct, and deprecate the old APIs. For Hebrew only.

Approval:

sffc commented 9 months ago

Does keviyah work over all dates or are there cases where we would need to fall back to the book calculations? If keviyah is a true drop-in replacement, then yeah, we can just use it all the time and make Hebrew independent of data.

Manishearth commented 9 months ago

All dates (in range, which is a range of ~i32 years I think: check the code)

Prior versions of the Hebrew calendar used slightly different four gates tables, Adjler documents them all, but the book does not handle that either, it implements the modern set of calculations.

The only question here is if it is fast enough for us to never want to add data loading. The benchmarks convince me but they are definitely at a level where I would not be surprised if others were not convinced.

Manishearth commented 6 months ago

I've started working on the islamic ones. My current data model is to store month length booleans and then use the remaining 4 bits to store a new years offset from the mean synodic new year (year * 12 * MEAN_SYNODIC_MONTH). It's unlikely that that number will be super large.