unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.
https://icu4x.unicode.org
Other
1.37k stars 176 forks source link

Trie-based normalization passthrough is slower than in ICU4C #2431

Open hsivonen opened 2 years ago

hsivonen commented 2 years ago

norm_bench compares the normalization performance of ICU4X against other implementations, including ICU4C. The test strings are relative long (multiple memory pages). To avoid testing rust_icu buffer sizing logic, you need the rawdec branch of my fork of rust_icu.

With #2378 applied, English UTF-16 NFC to NFC and NFD to NFD show ICU4X performing a bit better than ICU4C, so the Latin passthrough comparison works as intended. (Both ICU4X and ICU4C pass through based on comparing with a boundary value.)

However, since Hanzi is normalization-invariant, ICU4X and ICU4C should both stay on a fast-path loop with one trie lookup per character such that the trie lookup yields the default value for the trie. Yet, ICU4X is slower than ICU4C for Chinese (UTF-16 NFC to NFC, UTF-16 NFD to NFD). The difference is not about trie type fast vs. small. With #2410 applied, the difference shouldn't be about ICU4C using macros for the trie fast-path.

Possible differences:

sffc commented 2 years ago

Is this fixed by #2378?

hsivonen commented 2 years ago

Is this fixed by #2378?

No, this is a follow-up to that one.

sffc commented 2 years ago

OK, can we assign it to a milestone then? My suggestions:

hsivonen commented 2 years ago

Perhaps on this level of loop tightness, branching on the ZeroVec borrow vs. owned discriminant is significant.

This appear to be the case especially when the queried-for character doesn't have an entry in the trie. See #2554.

Perhaps the CPU speculates into the weeds on the surrogate handling path instead of full CPU capability going to the BMP case?

Non-BMP support's existence seems expensive when normalizing BMP content to NFD, but I don't see a similar effect for NFC. I don't understand why. I tested this by removing non-BMP handling from the fast-path loops. However, I wasn't creative enough to work non-BMP handing back in such a way that wouldn't have been even worse than the current form.

glandium commented 2 years ago

Stupid question: in your test, is icu4c compiled with GCC or clang? If clang, is it the same version as rustc's LLVM?

glandium commented 2 years ago

Also: if clang with the same version as rustc's LLVM, is it at the same optimization level?

hsivonen commented 2 years ago

I have tried distro-provided ICU4C 70, which is presumably compiled with GCC, ICU 72.1 with clang 14 from mach bootstrap, and ICU 72.1 with clang 15 from apt.llvm.org. In all cases, Chinese passthrough (which is the first item to investigate) is slower in ICU4X than in ICU4C. So I have tried the same LLVM major version (15) for both but not the same exact LLVM revision.

In the clang case, I used whatever ICU4C's build system does out-of-the-box for release mode and whatever cargo criterion does out-of-the-box. So, no, I have not verified the use of the same LLVM optimization level in both cases.

hsivonen commented 2 years ago

ICU4C build system used -O3 with clang. Last I checked, which was a long time ago, opt_level=3 was the cargo default, but I haven't checked whether cargo criterion overrides it.

hsivonen commented 2 years ago

It sure looks a lot like the Rust code should have gotten compiled with opt_level=3 as well.

hsivonen commented 2 years ago

However, for Hanzi, ICU4X should make the decision in the first branch of the branchy code.

It would be worthwhile to check that the branchy code is compiled with check1-return, check2-return, etc. and not compiled into combining all the checks and then returning.

hsivonen commented 2 years ago

It turns out that I was running benchmarks without Rust-side LTO. LTO improves perf for the other Rust crates, generally improves ICU4X perf in cases where the trie returns non-default values often, and regresses ICU4X performance for cases that stay on Latin passthrough across multiple characters but still go through the trie relatively often.

However, the case that should be investigated first, i.e. ICU4X UTF-16 Chinese passthrough based on the trie returning default values, regresses with LTO.

hsivonen commented 2 years ago

Updated the results to use Rust-side LTO. Moved and linked the old results.