unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.
https://icu4x.unicode.org
Other
1.33k stars 174 forks source link

Polish UnicodeSet builder API #3550

Open robertbastian opened 1 year ago

robertbastian commented 1 year ago

cc @skius

skius commented 1 year ago

Context: The reason why our usual load-data-in-constructor pattern doesn't work here is that currently, the internal functions I'm using (load_for_ecma262_unstable, load_general_category, ..) accept the DataProvider and not the payload.

skius commented 1 year ago

We also need to decide how "overescaped" inputs should be handled, i.e., the kinds of inputs that exemplar characters is dealing with.

Thought: I believe allowing users to supply their own iterators would solve this issue with minimal effort on their and our side

skius commented 1 year ago

An alternative API to the free parse function:

let builder = UnicodeSetBuilder::new(provider);
let set = builder.parse(source, options)?;

Furthermore, the transliterator would benefit from a function that either takes an Iterator, or returns how much of the source string was actually parsed. That allows the transliterator parser to hand over to the UnicodeSet parser as soon as it reaches a [, and then continue from where the UnicodeSet parser left off after a successful UnicodeSet parse.

sffc commented 1 year ago

Another option (not necessarily recommending it) would be to do like we do in TimeZoneFormatter, where we have a bunch of Option<DataPayload<_>> for the different keys we may or may not need, and then a bunch of little load functions that fill in each key. In our case, the parse function could fail with a list of keys that need to be loaded or something like that. We could also have a parse function that takes &mut self and fills in the payloads as needed.

skius commented 1 year ago

That sounds a bit like a caching provider, or is a caching layer too complex if each provider individually is "optional"? (I'm not too sure, but I feel like such an implementation might need a type-based map (eg BidiControlV1Marker -> Option<DataPayload<BidiControlV1Marker>>))

sffc commented 1 year ago

I'd characterize it less as "caching" and more as "lazy loading" but you're correct.

skius commented 1 year ago

If we had a struct LazyProvider<P: DataProvider<M: DataMarker>> that impls DataProvider<M> for LazyProvider<P> that does the lazy-loading/caching you describe, we could keep all APIs the same but would avoid multiple calls to the underlying (potentially expensive) DataProvider::load function.

Do you think that's more expensive than TimeZoneFormatter's explicit approach? I feel like it just boils down to being able to generically replace TimeZoneDataPayloads

sffc commented 1 year ago

Discuss with:

Optional:

sffc commented 1 year ago
  1. Eagerly load all of the property data when constructing a UnicodeSet Parser
  2. Have slots in the UnicodeSet parser that are lazily loaded / cached
  3. Always load the properties on demand when they are needed
pub struct PropertyNameToSetMapper { ... }
impl PropertyNameToSetMapper {
    pub fn try_new_for_ecma262(...) -> Self { ... }
    pub fn try_new_all(...) -> Self { ... }

    pub fn load_for_property(&self, name: &str) -> CodePointSetDataBorrowed { ... }
    pub fn load_for_property_with_value(&self, name: &str, value: &str) -> CodePointSetDataBorrowed { ... }
    // also strict/loose versions of those functions
}
pub trait PropertyNameToSetMapperLike {
    type Error: Debug;
    fn load_for_property(&self, name: &str) -> Result<CodePointSetDataBorrowed, Error> { ... }
    fn load_for_property_with_value(&self, name: &str, value: &str) -> Result<CodePointSetDataBorrowed, Error> { ... }
   // with the strict and loose
}

LGTM: @Manishearth @sffc @robertbastian @skius @eggrobin

Next: free function?

LGTM: @Manishearth @skius @younies @robertbastian @eggrobin @sffc

Any additional feedback on this, @echeran?