unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.
https://icu4x.unicode.org
Other
1.35k stars 174 forks source link

Remove allocating datetime skeletons (DateSkeletonPatternsV1, datetime/skeletons@1) #1678

Closed Manishearth closed 3 days ago

Manishearth commented 2 years ago

part of https://github.com/unicode-org/icu4x/issues/876, part of https://github.com/unicode-org/icu4x/issues/856

DateSkeletonPatternsV1 is one of our final remaining non-zero-copy types. It's a rather complicated tree:

pub struct DateSkeletonPatternsV1<'data>(
    pub LiteMap<SkeletonV1, PatternPlurals<'data>>,
);

pub struct SkeletonV1(pub Skeleton);
pub struct Skeleton(pub(crate) SmallVec<[fields::Field; 5]>);

// Has a ULE type already
pub struct Field {
    pub symbol: FieldSymbol,
    pub length: FieldLength,
}

pub enum PatternPlurals<'data> {
    /// A collection of pattern variants for when plurals differ.
    MultipleVariants(PluralPattern<'data>),
    /// A single pattern.
    SinglePattern(Pattern<'data>),
}

pub struct PluralPattern<'data> {
    /// The field that 'variants' are predicated on.
    pivot_field: Week,
    pub(crate) zero: Option<Pattern<'data>>,
    pub(crate) one: Option<Pattern<'data>>,
    pub(crate) two: Option<Pattern<'data>>,
    pub(crate) few: Option<Pattern<'data>>,
    pub(crate) many: Option<Pattern<'data>>,
    pub(crate) other: Pattern<'data>,
}

pub struct Pattern<'data> {
    pub items: ZeroVec<'data, PatternItem>,
    pub(crate) time_granularity: TimeGranularity,
}

There are two parts to this: firstly, the skeleton needs to be made zero-copy, and then PatternPlurals. Both need to be VarULE or ULE to work inside a ZeroMap.

Skeleton

Skeleton seems easy, we replace Skeleton with #[make_varule] struct Skeleton(ZeroVec<'data, Fields>). I'm a bit worried that this will make lookup rather slow (since Ord will be slower): but also as far as I can tell, we never use the LiteMap as an actual map at use time, we only iterate the map in order.

Another option is to replace Skeleton with the unparsed skeleton string.

PatternPlurals

I'd rather not write a custom ULE type here, but ultimately, we can. I do think, however, we can get most of the benefits by restructuring this type a bit.

Basically, this type can be structured as

#[make_varule]
struct  PatternPlurals<'data> {
    pivot_field: Option<Week>, // this needs a ULE impl for Option, which can be done.
    patterns: VarZeroVec<'data, VarZeroSlice<Option<Pattern>>>,
}

#[make_varule]
pub struct Pattern<'data> {
    pub(crate) time_granularity: TimeGranularity,
    pub items: ZeroVec<'data, PatternItem>,
}

We will need an AsULE implementation on Option<T: AsULE> as well as a VarULE implementation on Option<T: VarULE> in cases where we can guarantee that T never has length zero (this can be done with an additional trait).

Then, PatternPlurals::patterns becomes a vector that must have at least one non-None element in the beginning, and the rest of the elements can be nonexistent or None (or we can guarantee that it either has length 1 or length 6).

Thoughts?

Feedback requested:

sffc commented 2 years ago

I wish we could avoid making PatternPlurals a ULE.

This part of the datetime provider may be changing as part of #1317. In that issue, the skeletons will no longer be an unlimited map; there may be a limited number of possible skeletons, such that we can just make a big struct and let Postcard do the work instead of ZeroVec.

As a general rule, if we start having very deeply nested ZeroMaps, we should rethink the architecture of the data payload. Putting more into the Postcard part of the deserialization is also better for CrabBake, since CrabBake can bypass Postcard but it can't bypass ZeroVec.

Manishearth commented 2 years ago

As a general rule, if we start having very deeply nested ZeroMaps, we should rethink the architecture of the data payload. Putting more into the Postcard part of the deserialization is also better for CrabBake, since CrabBake can bypass Postcard but it can't bypass ZeroVec.

I strongly agree with this. Though one hazard with making it a big struct is that adding fields in the future is a breaking change. We could still make that work with an array but we're back to nesting zerovecs and needing ULEs.

I accidentally posted this issue before I was done but I was considering advocating for a case where we do more of the parsing at formatter construction time instead of mandating it be zero-copy. But if this is changing in https://github.com/unicode-org/icu4x/issues/1317, that works too.

If we're going for a big struct, we can block this on #1317 and move on. @gregtatum , do you have an idea of how long #1317 will take? Moving everything to zero-copy does block CrabBake, and while we can have components opt out of CrabBake I think it would be nice if DTF didn't need to.

sffc commented 2 years ago

Moving everything to zero-copy does block CrabBake, and while we can have components opt out of CrabBake I think it would be nice if DTF didn't need to.

My preference for a short-term solution, if we need it, would be to put skeletons behind a feature flag, so that DTF with default features is zero-copy.

Manishearth commented 2 years ago

Yeah I think I can just make it so that CrabBake's provider is unable to provide certain keys

gregtatum commented 2 years ago

Another option is to replace Skeleton with the unparsed skeleton string.

I'm not sure with Zero copy if this adds anything, as you would then have to parse the skeleton, and do similar validation work. I'm not as familiar with the implementation details of zero copy to really speak with expertise on it though.

Manishearth commented 2 years ago

I'm not sure with Zero copy if this adds anything, as you would then have to parse the skeleton, and do similar validation work. I'm not as familiar with the implementation details of zero copy to really speak with expertise on it though.

Yep; we don't get any zc perf benefit, it just enables crab-bake (which requires everything to be zerocopy)

Manishearth commented 2 years ago

I would also prefer to not go down the "data zero-copy but we allocate afterwards" route though

zbraniecki commented 2 years ago

that looks good to me. I'd suggest making sure that the PatternPlurals encapsulation is really good so that even internal consumers never think of it as an array and never have to reason about which plural category 3rd element is.

gregtatum commented 2 years ago

it just enables crab-bake

In that case that seems fine to me.

nordzilla commented 2 years ago

LGTM after reading the comments. I don't think I have anything else to add.

robertbastian commented 3 months ago

This needs to be removed

sffc commented 5 days ago

The data has been removed, and I'm removing the marker in https://github.com/unicode-org/icu4x/pull/5612

To track a longer-term home for these things, I created https://github.com/unicode-org/icu4x/issues/5611