Open hsivonen opened 2 years ago
Is this fixed by #2378?
Is this fixed by #2378?
No, this is a follow-up to that one.
OK, can we assign it to a milestone then? My suggestions:
Perhaps on this level of loop tightness, branching on the
ZeroVec
borrow vs. owned discriminant is significant.
This appear to be the case especially when the queried-for character doesn't have an entry in the trie. See #2554.
Perhaps the CPU speculates into the weeds on the surrogate handling path instead of full CPU capability going to the BMP case?
Non-BMP support's existence seems expensive when normalizing BMP content to NFD, but I don't see a similar effect for NFC. I don't understand why. I tested this by removing non-BMP handling from the fast-path loops. However, I wasn't creative enough to work non-BMP handing back in such a way that wouldn't have been even worse than the current form.
Stupid question: in your test, is icu4c compiled with GCC or clang? If clang, is it the same version as rustc's LLVM?
Also: if clang with the same version as rustc's LLVM, is it at the same optimization level?
I have tried distro-provided ICU4C 70, which is presumably compiled with GCC, ICU 72.1 with clang 14 from mach bootstrap
, and ICU 72.1 with clang 15 from apt.llvm.org. In all cases, Chinese passthrough (which is the first item to investigate) is slower in ICU4X than in ICU4C. So I have tried the same LLVM major version (15) for both but not the same exact LLVM revision.
In the clang case, I used whatever ICU4C's build system does out-of-the-box for release mode and whatever cargo criterion does out-of-the-box. So, no, I have not verified the use of the same LLVM optimization level in both cases.
ICU4C build system used -O3
with clang. Last I checked, which was a long time ago, opt_level=3
was the cargo default, but I haven't checked whether cargo criterion overrides it.
It sure looks a lot like the Rust code should have gotten compiled with opt_level=3
as well.
However, for Hanzi, ICU4X should make the decision in the first branch of the branchy code.
It would be worthwhile to check that the branchy code is compiled with check1-return, check2-return, etc. and not compiled into combining all the checks and then returning.
It turns out that I was running benchmarks without Rust-side LTO. LTO improves perf for the other Rust crates, generally improves ICU4X perf in cases where the trie returns non-default values often, and regresses ICU4X performance for cases that stay on Latin passthrough across multiple characters but still go through the trie relatively often.
However, the case that should be investigated first, i.e. ICU4X UTF-16 Chinese passthrough based on the trie returning default values, regresses with LTO.
Updated the results to use Rust-side LTO. Moved and linked the old results.
norm_bench
compares the normalization performance of ICU4X against other implementations, including ICU4C. The test strings are relative long (multiple memory pages). To avoid testingrust_icu
buffer sizing logic, you need therawdec
branch of my fork ofrust_icu
.With #2378 applied, English UTF-16 NFC to NFC and NFD to NFD show ICU4X performing a bit better than ICU4C, so the Latin passthrough comparison works as intended. (Both ICU4X and ICU4C pass through based on comparing with a boundary value.)
However, since Hanzi is normalization-invariant, ICU4X and ICU4C should both stay on a fast-path loop with one trie lookup per character such that the trie lookup yields the default value for the trie. Yet, ICU4X is slower than ICU4C for Chinese (UTF-16 NFC to NFC, UTF-16 NFD to NFD). The difference is not about trie type fast vs. small. With #2410 applied, the difference shouldn't be about ICU4C using macros for the trie fast-path.
Possible differences:
ZeroVec
borrow vs. owned discriminant is significant.inline(never)
hurdle on the surrogate path to discourage speculation that way, but that didn't help.)ZeroVec
s such that the index for the reading from the "data"ZeroVec
depends on what was read from the "index"ZeroVec
. Perhaps unaligned-capable access from the "index"ZeroVec
interferes with the CPU speculatively performing the "data"ZeroVec
access.ZeroVec
, the actual hot working set difference seems implausible as a cause.