Convert `width` to use match rather than manual if tree.

resistor commented 2 years ago

This improves the performance of tests::cargo benchmark from ~3800 ns/iter to ~3200 ns/iter on my laptop. It also improves the performance of uutils wc -L on a sample Unicode file by 10%.

resistor commented 2 years ago

Interestingly, this appears to perform worse on enwik8 and jawiki. Maybe something about the mix of code points?

Manishearth commented 2 years ago

I'm pretty sure this was a deliberate change to improve perf for non-ascii

https://github.com/unicode-rs/unicode-width/pull/28/files#diff-b3def89f695aabc248655f61e7a81110db7ad016387c55260ec579bdc3191e0cL41 (cc @mjguynn )

mjguynn commented 2 years ago

Here's a comparison between the compiled methods. https://godbolt.org/z/7dKKqxj91

After compilation, control flow for the fast path roughly looks like this:

if tree

// the majority of text is ascii, so this branch should be predicted most of the time
if c >= '\u{7F}' {
  // these characters are pretty rare, so prediction should be very accurate
  if c < '\u{A0}' {
    return None;
  }
  return Some(lookup_width(c, is_cjk));
}
// this branch is highly unpredictable since spaces, tabs, newlines are so common
if c >= '\u{20}' {
  return Some(1);
}
// in the generated assembly, this one is branchless
if c == '\u{0}' { return Some(0); }
else { return None; }

match

// null characters seem pretty rare in text, so I'd expect this to be almost always predicted
if c == '\u{0}' {
  return Some(0);
}
// this branch is highly unpredictable since spaces, tabs, newlines are so common
if c < '\u{20}' {
  return None;
}
// the majority of text is ascii, so this branch should be predicted most of the time
if c >= '\u{7F}' {
  // these characters are pretty rare, so prediction should be very accurate
  if c >= '\u{A0}' {
    return Some(lookup_width(c, is_cjk));
  }
  return None;
}
return Some(1);

Using an if tree, the relatively predictable \u{7F} branch occurs before the relatively unpredictable \u{20} branch. Using a match statement, the situation is reversed, and we have a relatively predictable branch after a relatively unpredictable one. I suspect this is why the if tree performs better on the Wikipedia samples. However, tests::cargo is just the same character 4096 times, so the branch predictors will be ~100% accurate. IIRC a branch can be predicted 100% of the time is faster than a branchless comparison, so that explains why the match statement is faster on the cargo test -- it uses a 100% predictable branch for checking whether the character is less than zero while the if tree uses a branchless method. (Edit: nope, because the branchless comparison isn't even reached on that test case. What was I thinking?) (Take this analysis with a grain of salt since I only checked the width function in isolation, inlining it into fold might change the performance characteristics.)

EDIT 2: Making the \u{20} comparison branchess actually harms performance, so perhaps that branch is more predictable than I realize, or maybe the branch is better inlined into string width. Again, take my analysis with a grain of salt.

resistor commented 2 years ago

For what it's worth, my motivating use case was wc -L (max line length counting) on uutils wc, using a test case consisting of 1024 copies of the Odyssey in the original Greek, so definitely not an ASCII file. I am able to measure an approximate 10% runtime reduction from using the match statement in that scenario.

unicode-rs / unicode-width

Convert `width` to use match rather than manual if tree. #30