unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.
https://icu4x.unicode.org
Other
1.37k stars 176 forks source link

Allow CPT get_range() accept a value transformation fn as arg #1532

Open echeran opened 2 years ago

echeran commented 2 years ago

In ICU's implementation of getRange() for a CodePointTrie, it allows an extra arg that is a function that transforms the trie values before computing ranges of code points sharing the same value (called ValueFilter in ICU4J, called UCPMapValueFilter in ICU4C). ICU uses this in various ways, such as for pulling out data for a single property out of a trie that efficiently stores data for multiple properties at once.

The port of getRange() in #1529 omits this argument, which returns the trie values as-is. Adding this functionality could have multiple benefits.

One thing that might need to be considered is the type of the transformation function arg (Fn trait?) and whether that has any implications (ex: FFI) that may have follow-on considerations.

@markusicu

sffc commented 2 years ago

What is the concrete benefit? You can already write

cpt.iter_ranges(v).map(|range| /* transform here */)

What is the advantage of changing it to

cpt.map_values(|v| /* transform here */)

Is there a performance benefit? What is it?

markusicu commented 2 years ago

Each time you enter the get_range() function it has to do one-time setup, like an expanded version of a trie lookup. Then it continues visiting code points individually and by blocks, taking measures to minimize lookup indexing and reading of trie data values.

If you supply it with a ValueFilter (really a mapper I guess but this is what it's called), then get_range() itself will coalesce adjacent ranges that end up with the same effective value, and will continue with the inner loop. If you filter & map on the outside, you get more get_range() calls, losing some of those optimizations.

sffc commented 2 years ago

Blocked on adding benchmarks. @echeran can you link an issue for the benchmarks?

sffc commented 2 years ago

Blocked on #1946

Manishearth commented 1 year ago

I'm just adding a separate method in https://github.com/unicode-org/icu4x/pull/3198 that explicitly takes a predicate to filter by.

I have marked that as fixing this PR but it does not do value transformation, only filtering: I think that covers most use cases, but perhaps not all of them?

markusicu commented 1 year ago

It is useful for the callback to be able to do multiple things. Examples from ICU4C/J:

Manishearth commented 1 year ago
  • More complex predicates. "Get the ranges where gc=L" involves taking those bits, using them to create a single-bit mask (1<<gc), and intersecting that with the multi-bit mask for L.

  • Another lookup example: "Get the ranges for scx=Cyrl" either compares with the Script in the trie value, or uses the trie value as an index in a list of Script codes where it performs a contains() search.

So the predicate thing already works for these two since these two just need ranges (not ranges + value)

For the other two, that makes sense. I might switch my impl over to that?

I don't actually think we need benchmarks here; because there's an argument to be made without benchmarks that people may want and even expect their ranges to be maximal.