Investigate performance impact of rearranging "can combine backwards" bit

hsivonen commented 5 months ago

For characters that are their own decomposition, the least significant bit signifies "can combine backwards". As of Unicode 16, this information is also needed for complex decompositions, but the same bit was already taken, so the second-least-significant bit is used (by #4860).

Investigate the performance impact of flipping around the two bit allocations for complex decompositions and unifying the "can combine backwards" bit check.

sffc commented 5 months ago

Seems like something that would be beneficial to do in 2.0. Anyone can take this and @hsivonen has left enough of a trail. Perhaps @echeran

sffc commented 1 week ago

Estimation of 2.0 status: time to land normalization performance?

@hsivonen Is there time to land normalization data struct performance?
@Manishearth Next two weeks... I think we're very close. We've been chipping away at the small things. I'm going by the plan for beta.
@sffc I think a data struct change could still land in 2.0 final. It doesn't need to be in beta.
@Manishearth Can you describe the nature of the changes?
@hsivonen (1) deconposition doesn't have a bit to say if there is ... (2) the trivial bit. (3) The K normalizations are supplementary tries as opposed to duplicated data. (4) Checks eagerly for Hangul instead of having a trie value for it. It would be nice to make these better.
@Manishearth For changing data for 2.0 final, we can mostly do that. 2.0 beta is complete more or less. We have mid-Dec for 2.0 final.
@sffc If you land those things sometime in Q4, there is a high chance that we can get those in.
@hsivonen I only expect the controversy to be the change to the data struct. Other than that, I expect it to be, "benchmarks are improved".

unicode-org / icu4x

Investigate performance impact of rearranging "can combine backwards" bit #4967