Closed resistor closed 1 year ago
Interestingly, this appears to perform worse on enwik8
and jawiki
. Maybe something about the mix of code points?
I'm pretty sure this was a deliberate change to improve perf for non-ascii
https://github.com/unicode-rs/unicode-width/pull/28/files#diff-b3def89f695aabc248655f61e7a81110db7ad016387c55260ec579bdc3191e0cL41 (cc @mjguynn )
Here's a comparison between the compiled methods. https://godbolt.org/z/7dKKqxj91
After compilation, control flow for the fast path roughly looks like this:
if tree
// the majority of text is ascii, so this branch should be predicted most of the time
if c >= '\u{7F}' {
// these characters are pretty rare, so prediction should be very accurate
if c < '\u{A0}' {
return None;
}
return Some(lookup_width(c, is_cjk));
}
// this branch is highly unpredictable since spaces, tabs, newlines are so common
if c >= '\u{20}' {
return Some(1);
}
// in the generated assembly, this one is branchless
if c == '\u{0}' { return Some(0); }
else { return None; }
match
// null characters seem pretty rare in text, so I'd expect this to be almost always predicted
if c == '\u{0}' {
return Some(0);
}
// this branch is highly unpredictable since spaces, tabs, newlines are so common
if c < '\u{20}' {
return None;
}
// the majority of text is ascii, so this branch should be predicted most of the time
if c >= '\u{7F}' {
// these characters are pretty rare, so prediction should be very accurate
if c >= '\u{A0}' {
return Some(lookup_width(c, is_cjk));
}
return None;
}
return Some(1);
Using an if tree, the relatively predictable \u{7F}
branch occurs before the relatively unpredictable \u{20}
branch. Using a match statement, the situation is reversed, and we have a relatively predictable branch after a relatively unpredictable one. I suspect this is why the if tree performs better on the Wikipedia samples. However, (Edit: nope, because the branchless comparison isn't even reached on that test case. What was I thinking?) (Take this analysis with a grain of salt since I only checked the width function in isolation, inlining it into fold might change the performance characteristics.)tests::cargo
is just the same character 4096 times, so the branch predictors will be ~100% accurate. IIRC a branch can be predicted 100% of the time is faster than a branchless comparison, so that explains why the match statement is faster on the cargo test -- it uses a 100% predictable branch for checking whether the character is less than zero while the if tree uses a branchless method.
EDIT 2: Making the \u{20}
comparison branchess actually harms performance, so perhaps that branch is more predictable than I realize, or maybe the branch is better inlined into string width. Again, take my analysis with a grain of salt.
For what it's worth, my motivating use case was wc -L
(max line length counting) on uutils wc
, using a test case consisting of 1024 copies of the Odyssey in the original Greek, so definitely not an ASCII file. I am able to measure an approximate 10% runtime reduction from using the match statement in that scenario.
This improves the performance of tests::cargo benchmark from ~3800 ns/iter to ~3200 ns/iter on my laptop. It also improves the performance of uutils
wc -L
on a sample Unicode file by 10%.