Open echeran opened 2 years ago
What is the concrete benefit? You can already write
cpt.iter_ranges(v).map(|range| /* transform here */)
What is the advantage of changing it to
cpt.map_values(|v| /* transform here */)
Is there a performance benefit? What is it?
Each time you enter the get_range() function it has to do one-time setup, like an expanded version of a trie lookup. Then it continues visiting code points individually and by blocks, taking measures to minimize lookup indexing and reading of trie data values.
If you supply it with a ValueFilter (really a mapper I guess but this is what it's called), then get_range() itself will coalesce adjacent ranges that end up with the same effective value, and will continue with the inner loop. If you filter & map on the outside, you get more get_range() calls, losing some of those optimizations.
Blocked on adding benchmarks. @echeran can you link an issue for the benchmarks?
Blocked on #1946
I'm just adding a separate method in https://github.com/unicode-org/icu4x/pull/3198 that explicitly takes a predicate to filter by.
I have marked that as fixing this PR but it does not do value transformation, only filtering: I think that covers most use cases, but perhaps not all of them?
It is useful for the callback to be able to do multiple things. Examples from ICU4C/J:
1<<gc
), and intersecting that with the multi-bit mask for L.
More complex predicates. "Get the ranges where gc=L" involves taking those bits, using them to create a single-bit mask (
1<<gc
), and intersecting that with the multi-bit mask for L.Another lookup example: "Get the ranges for scx=Cyrl" either compares with the Script in the trie value, or uses the trie value as an index in a list of Script codes where it performs a contains() search.
So the predicate thing already works for these two since these two just need ranges (not ranges + value)
For the other two, that makes sense. I might switch my impl over to that?
I don't actually think we need benchmarks here; because there's an argument to be made without benchmarks that people may want and even expect their ranges to be maximal.
In ICU's implementation of
getRange()
for a CodePointTrie, it allows an extra arg that is a function that transforms the trie values before computing ranges of code points sharing the same value (calledValueFilter
in ICU4J, calledUCPMapValueFilter
in ICU4C). ICU uses this in various ways, such as for pulling out data for a single property out of a trie that efficiently stores data for multiple properties at once.The port of
getRange()
in #1529 omits this argument, which returns the trie values as-is. Adding this functionality could have multiple benefits.One thing that might need to be considered is the type of the transformation function arg (
Fn
trait?) and whether that has any implications (ex: FFI) that may have follow-on considerations.@markusicu