Trie-based normalization passthrough is slower than in ICU4C

hsivonen commented 2 years ago

norm_bench compares the normalization performance of ICU4X against other implementations, including ICU4C. The test strings are relative long (multiple memory pages). To avoid testing rust_icu buffer sizing logic, you need the rawdec branch of my fork of rust_icu.

With #2378 applied, English UTF-16 NFC to NFC and NFD to NFD show ICU4X performing a bit better than ICU4C, so the Latin passthrough comparison works as intended. (Both ICU4X and ICU4C pass through based on comparing with a boundary value.)

However, since Hanzi is normalization-invariant, ICU4X and ICU4C should both stay on a fast-path loop with one trie lookup per character such that the trie lookup yields the default value for the trie. Yet, ICU4X is slower than ICU4C for Chinese (UTF-16 NFC to NFC, UTF-16 NFD to NFD). The difference is not about trie type fast vs. small. With #2410 applied, the difference shouldn't be about ICU4C using macros for the trie fast-path.

Possible differences:

Perhaps on this level of loop tightness, branching on the normalization supplement presence discriminant (for seeing if there's a UTS46 or K normalization supplement) is significant.
Perhaps on this level of loop tightness, branching on the ZeroVec borrow vs. owned discriminant is significant.
Perhaps on this level of loop tightness, branching on Hangul range check is significant.
Perhaps all the above combined?
Perhaps the CPU speculates into the weeds on the surrogate handling path instead of full CPU capability going to the BMP case? (I tried putting an inline(never) hurdle on the surrogate path to discourage speculation that way, but that didn't help.)
The trie uses two ZeroVecs such that the index for the reading from the "data" ZeroVec depends on what was read from the "index" ZeroVec. Perhaps unaligned-capable access from the "index" ZeroVec interferes with the CPU speculatively performing the "data" ZeroVec access.
ICU4X uses 32-bit trie values while ICU4C uses 16-bit trie values, so ICU4X might have a larger working set for the trie. However, since Hanzi should always resolve to the default value of the trie without reading all over the "data" ZeroVec, the actual hot working set difference seems implausible as a cause.
Upon inspecting the trie value received, ICU4C seems to have less branchy code than ICU4X for making the decision to keep looping in the fast-path loop. However, for Hanzi, ICU4X should make the decision in the first branch of the branchy code.
Optimizations happening in an unlucky way. (Getting the Latin passthrough in ICU4X to perform as well as ICU4C was surprisingly sensitive to compiler optimizations and memory copy working set sizes going the right way: Details of code seemingly outside the tightest loop had a notable impact on the performance of the tightest loop.)

sffc commented 2 years ago

Is this fixed by #2378?

hsivonen commented 2 years ago

Is this fixed by #2378?

No, this is a follow-up to that one.

sffc commented 2 years ago

OK, can we assign it to a milestone then? My suggestions:

If this is something you plan to work on this year or early next year, put it in 1.1 or 1.2
If this is something we should definitely fix, but which is currently unscheduled, put it in 1.x
If this requires API changes, put it in 2.0
If this is an enhancement that could be fixed at some point with lower priority, label it as "backlog"

hsivonen commented 2 years ago

Perhaps on this level of loop tightness, branching on the ZeroVec borrow vs. owned discriminant is significant.

This appear to be the case especially when the queried-for character doesn't have an entry in the trie. See #2554.

Perhaps the CPU speculates into the weeds on the surrogate handling path instead of full CPU capability going to the BMP case?

Non-BMP support's existence seems expensive when normalizing BMP content to NFD, but I don't see a similar effect for NFC. I don't understand why. I tested this by removing non-BMP handling from the fast-path loops. However, I wasn't creative enough to work non-BMP handing back in such a way that wouldn't have been even worse than the current form.

glandium commented 2 years ago

Stupid question: in your test, is icu4c compiled with GCC or clang? If clang, is it the same version as rustc's LLVM?

glandium commented 2 years ago

Also: if clang with the same version as rustc's LLVM, is it at the same optimization level?

hsivonen commented 2 years ago

I have tried distro-provided ICU4C 70, which is presumably compiled with GCC, ICU 72.1 with clang 14 from mach bootstrap, and ICU 72.1 with clang 15 from apt.llvm.org. In all cases, Chinese passthrough (which is the first item to investigate) is slower in ICU4X than in ICU4C. So I have tried the same LLVM major version (15) for both but not the same exact LLVM revision.

In the clang case, I used whatever ICU4C's build system does out-of-the-box for release mode and whatever cargo criterion does out-of-the-box. So, no, I have not verified the use of the same LLVM optimization level in both cases.

hsivonen commented 2 years ago

ICU4C build system used -O3 with clang. Last I checked, which was a long time ago, opt_level=3 was the cargo default, but I haven't checked whether cargo criterion overrides it.

hsivonen commented 2 years ago

It sure looks a lot like the Rust code should have gotten compiled with opt_level=3 as well.

hsivonen commented 2 years ago

However, for Hanzi, ICU4X should make the decision in the first branch of the branchy code.

It would be worthwhile to check that the branchy code is compiled with check1-return, check2-return, etc. and not compiled into combining all the checks and then returning.

hsivonen commented 2 years ago

It turns out that I was running benchmarks without Rust-side LTO. LTO improves perf for the other Rust crates, generally improves ICU4X perf in cases where the trie returns non-default values often, and regresses ICU4X performance for cases that stay on Latin passthrough across multiple characters but still go through the trie relatively often.

However, the case that should be investigated first, i.e. ICU4X UTF-16 Chinese passthrough based on the trie returning default values, regresses with LTO.

hsivonen commented 2 years ago

Updated the results to use Rust-side LTO. Moved and linked the old results.

unicode-org / icu4x

Trie-based normalization passthrough is slower than in ICU4C #2431