Speedup using vectorization

behdad commented 7 years ago

It feels like we've hit the limit of making HarfBuzz faster using traditional methods. Next round of optimizations will be by using generic or architecture-specific SIMD vectorization intrinsics.

I have two ideas that are ripe with potential; one affects heavy fonts, the other light ones. So, combined, they can probably provide a good speedup all around.

For heavy fonts, ie. those having dozens, or over a hundred, lookups (Amiri, IranNastaliq, heaving Indic fonts, etc), the bottleneck is just looping over lookups and for each lookup over the buffer contents and seeing if current lookup applies to current glyph; normally that would be a binary search in the Coverage table. I made this much faster a few years ago by adding these hb_set_digest_t things, which are narrow (3*sizeof(long) bytes) structures that implement approximate matching, similar to Bloom Filters or Quotient Filters. These digests do all bitwise operations, so they can be easily vectorized. Combined with a gather operation, or just multiple fetches in a row (which should parallelize), this will allow us to skip over 4 or 8 glyphs at a time.
For fast fonts, like simple Latin fonts, like Roboto, the majority of time is spent in binary searching in the Coverage table of kern and liga lookups. For this, any way to speed up the binary search using vectorization will be helpful. Or possibly even just parallelism, not vectorization. For example, implementing a quaternary search instead of binary, that fetches the three mid values first and does the comparisons after might make it much faster if accessing memory is the bottleneck.

behdad commented 7 years ago

@drott

behdad commented 7 years ago

For heavy fonts, ie. those having dozens, or over a hundred, lookups (Amiri, IranNastaliq, heaving Indic fonts, etc), the bottleneck is just looping over lookups and for each lookup over the buffer contents and seeing if current lookup applies to current glyph; normally that would be a binary search in the Coverage table. I made this much faster a few years ago by adding these hb_set_digest_t things, which are narrow (3*sizeof(long) bytes) structures that implement approximate matching, similar to Bloom Filters or Quotient Filters. These digests do all bitwise operations, so they can be easily vectorized. Combined with a gather operation, or just multiple fetches in a row (which should parallelize), this will allow us to skip over 4 or 8 glyphs at a time.

This happens in apply_forward(), and the function to be vectorized is may_have().

behdad commented 7 years ago

For fast fonts, like simple Latin fonts, like Roboto, the majority of time is spent in binary searching in the Coverage table of kern and liga lookups. For this, any way to speed up the binary search using vectorization will be helpful. Or possibly even just parallelism, not vectorization. For example, implementing a quaternary search instead of binary, that fetches the three mid values first and does the comparisons after might make it much faster if accessing memory is the bottleneck.

This will be SortedArrayOf::bsearch(), although hb-ot-cmap-table.hh also has a custom bsearch() that can be sped up as well.

Adenilson commented 7 years ago

Joining the discussion as we have talked about this subject in WebEngines hackfest.

roozbehp commented 7 years ago

/cc @nona-google

behdad commented 6 years ago

For speeding up bsearch, I tried towards vectoriztion and almost failed. Some research brough this up:

http://databasearchitects.blogspot.com/2015/09/trying-to-speed-up-binary-search.html

Looks like the branchless version might give us a nice quick boost. Going to try.

behdad commented 6 years ago

Looks like the branchless version might give us a nice quick boost. Going to try.

Nope. No matter how, whatever change I make to my main bsearch it becomes slower!

behdad commented 6 years ago

Looks like the branchless version might give us a nice quick boost. Going to try. Nope. No matter how, whatever change I make to my main bsearch it becomes slower!

Here's the patch for future reference:

diff --git a/src/hb-open-type-private.hh b/src/hb-open-type-private.hh
index a12c4d23..c01719e6 100644
--- a/src/hb-open-type-private.hh
+++ b/src/hb-open-type-private.hh
@@ -1033,20 +1033,16 @@ struct SortedArrayOf : ArrayOf<Type, LenType>
   inline int bsearch (const SearchType &x) const
   {
     /* Hand-coded bsearch here since this is in the hot inner loop. */
-    const Type *array = this->array;
-    int min = 0, max = (int) this->len - 1;
-    while (min <= max)
+    /* http://databasearchitects.blogspot.com/2015/09/trying-to-speed-up-binary-search.html */
+    const Type *lower = this->array;
+    unsigned int n = this->len;
+    while (unsigned int half = n / 2)
     {
-      int mid = (min + max) / 2;
-      int c = array[mid].cmp (x);
-      if (c < 0)
-        max = mid - 1;
-      else if (c > 0)
-        min = mid + 1;
-      else
-        return mid;
+      const Type *middle = lower + half;
+      lower = middle->cmp (x) >= 0 ? middle : lower;
+      n -= half;
     }
-    return -1;
+    return lower->cmp (x) == 0 ? lower - this->array : -1;
   }
 };

Adenilson commented 5 years ago

I think I was finally able to complete the zlib optimization work (https://chromium-review.googlesource.com/1173262 and https://bugs.chromium.org/p/chromium/issues/detail?id=873725) and resume the work on HarfBuzz (HB).

Main issues I identified on HB were:

a) bsearch is crazy hot. I mean really, really hot! Just adding an NOP in the function start is enough to add 5 to 10% on processing time (depending on the platform) while running hb-shape.

b) Branch predictors are like cats: they may look the same but behave differently. An example is that on Intel the cost of branch mispredictions were around 0.5 to 1% running hb-shape with a long latin document, so that explains why the branch-less patch above didn't help.

Adenilson commented 5 years ago

So variations around branching or trying to onroll the loop didn't really produced consistent results between x86 and ARM.

That being said, one line struck me as odd: https://github.com/harfbuzz/harfbuzz/blob/master/src/hb-open-type.hh#L729

I wondered why not check for this->len == 1? The first question to be answered was how frequent that was, so adding a patch (https://gist.github.com/Adenilson/23c3d9b85e6e6e17425e541778d7388d) to count the occurrences produced this results:

a) Latin (0.71% of iterations) ./hb-shape Roboto-Regular.ttf --text-file=thelittleprince.txt --font-funcs=ot --output-format= --output-file=/dev/null --num-iterations=1

... running...

bsearch array[len]: 0 Total: 182942 CornerCase: 1304

bsearch array[len]: 0 Total: 182943 CornerCase: 1305

b) Complex (3.9% of iterations) ./hb-shape IranNastaliq2.ttf --text-file=test-long-paragraph.txt --output-file=/dev/null --num-iterations=1

... running...

bsearch array[len]: 0 Total: 334747 CornerCase: 13152

bsearch array[len]: 0 Total: 334749 CornerCase: 13153

Adenilson commented 5 years ago

Still, I decided to special case the len == 1 (see https://github.com/harfbuzz/harfbuzz/pull/1386) and got this results: a) Vanilla: 1000x runs (Latin and Complex)

adenilson@chrome-monster:~/temp/harfbuzz/build$ ./run.sh

real 0m13.343s user 0m13.343s sys 0m0.001s

adenilson@chrome-monster:~/temp/harfbuzz/build$ ./complex.sh

real 0m25.025s user 0m25.020s sys 0m0.004s

b) Patched: 1000x runs (Latin and Complex)

adenilson@chrome-monster:~/temp/harfbuzz/insane$ ./run.sh

real 0m4.094s user 0m4.094s sys 0m0.000s

adenilson@chrome-monster:~/temp/harfbuzz/insane$ ./complex.sh

real 0m24.629s user 0m24.624s sys 0m0.004s

Adenilson commented 5 years ago

Assuming I didn't make any major mistake (famous last words...), for x86 it was an 3.3x boost for latin without regressions for complex scripts (actually was 2% improvement).

For ARM, the results were even better: 3.8x boost for little cores (A53) and 3.6x boost for big cores (A72).

Adenilson commented 5 years ago

What is next: I would really appreciate double verification from @behdad to make sure that all is fine (and the perf numbers are correct).

We also got fix a few failing tests. After applying patch, tests passing are: 87% tests passed, 21 tests failed out of 159

Total Test time (real) = 7.44 sec

The following tests did not run: 157 - tests/basics.tests (Skipped) 158 - tests/full-font.tests (Skipped) 159 - tests/japanese.tests (Skipped)

The following tests FAILED: 38 - tests/arabic-mark-order.tests (Failed) 44 - tests/context-matching.tests (Failed) 45 - tests/cursive-positioning.tests (Failed) 50 - tests/hyphens.tests (Failed) 52 - tests/indic-decompose.tests (Failed) 54 - tests/indic-joiner-candrabindu.tests (Failed) 55 - tests/indic-joiners.tests (Failed) 57 - tests/indic-pref-blocking.tests (Failed) 61 - tests/indic-vowel-letter-spoofing.tests (Failed) 62 - tests/khmer-mark-order.tests (Failed) 63 - tests/khmer-misc.tests (Failed) 64 - tests/language-tags.tests (Failed) 65 - tests/ligature-id.tests (Failed) 66 - tests/mark-attachment.tests (Failed) 67 - tests/mark-filtering-sets.tests (Failed) 68 - tests/mongolian-variation-selector.tests (Failed) 74 - tests/simple.tests (Failed) 83 - tests/variations-rvrn.tests (Failed) 85 - tests/zero-width-marks.tests (Failed) 98 - tests/GPOS-4.tests (Failed) 100 - tests/GSUB-1.tests (Failed) Errors while running CTest

ebraminio commented 5 years ago

I guess templatizing bsearch, on possible cases, introduces some boost, or is that tested already?

behdad commented 5 years ago

I guess templatizing bsearch, on possible cases, introduces some boost, or is that tested already?

That's already the case, and why we don't use libc bsearch().

behdad commented 5 years ago

We also got fix a few failing tests

Wait. All our tests should pass. Let's figure out what's wrong with your setup first.

behdad commented 5 years ago

a) bsearch is crazy hot. I mean really, really hot!

Right. For shaping Latin text, bsearch() is pretty much all we do. Everything else, where data or algorithm is static, is optimized. It's just findind glyph number in font data that cannot be done better so far, so we end up doing that all the time.

ebraminio commented 5 years ago

That's already the case, and why we don't use libc bsearch().

Right. Some corner cases now I see.

behdad commented 5 years ago

Right. Some corner cases now I see.

The use of hb_bsearch is the exception. I see the use in base-table should be replaced by use of SortedArrayOf::bsearch instead of hb_bsearch.

behdad commented 5 years ago

I thought about this more. Some starting points:

https://software.intel.com/sites/landingpage/IntrinsicsGuide/

The following entry header:

/usr/lib/gcc/x86_64-redhat-linux/8/include/x86intrin.h

And the following example intrinsics:

_mm512_i32extgather_epi32 _mm512_mask_i32extgather_epi32 _mm512_prefetch_i32extgather_ps _mm512_mask_prefetch_i32extgather_ps

_mm512_cmp_epi16_mask

Back to my original report, here's how to apply this to the two issues I identified:

For heavy fonts, ie. those having dozens, or over a hundred, lookups (Amiri, IranNastaliq, heaving Indic fonts, etc), the bottleneck is just looping over lookups and for each lookup over the buffer contents and seeing if current lookup applies to current glyph; normally that would be a binary search in the Coverage table. I made this much faster a few years ago by adding these hb_set_digest_t things, which are narrow (3*sizeof(long) bytes) structures that implement approximate matching, similar to Bloom Filters or Quotient Filters. These digests do all bitwise operations, so they can be easily vectorized. Combined with a gather operation, or just multiple fetches in a row (which should parallelize), this will allow us to skip over 4 or 8 glyphs at a time.

In the main per-lookup loop (apply_forward), we collect 16/32 glyph indices from the buffer at a time to check them against the lookup set_digest accelerator. The set digest consists of bit operations, so those need to happen (via a template with overloaded operators perhaps) on the group of glyph indices at the same time. The final operation will simply look for the first glyph index that passes the test and jumps to that for applying the subtables. Otherwise, the entire group is skipped over.

For fast fonts, like simple Latin fonts, like Roboto, the majority of time is spent in binary searching in the Coverage table of kern and liga lookups. For this, any way to speed up the binary search using vectorization will be helpful. Or possibly even just parallelism, not vectorization. For example, implementing a quaternary search instead of binary, that fetches the three mid values first and does the comparisons after might make it much faster if accessing memory is the bottleneck.

This one requires speeding up binary search of a glyph index into coverage format 1/2. This should perform a 16ary search instead of binary, by loading multiple items and comparing them to the glyph index being searched and jump to the sub-range that matches.

Again, not sure if they work. But should be tried. Writing using a fixes set of intrinsics is fine to test. We can figure out portability if we end up pursuing this for real.

behdad commented 5 years ago

@ebraminio

behdad commented 4 years ago

I've started implementing designs sketched in this issue: https://github.com/harfbuzz/harfbuzz/pull/2065

I found that AVX2 allows doing exactly what I wanted, though with a wrinkle or two. Ongoing work. For ARM we need to figure out how to load data since gather is missing.

behdad commented 2 years ago

I've now done enough experiments, and failed, that I'm ready to close this issue.

harfbuzz / harfbuzz