Fix fuzzer-found issues (Part 1)

LaurenzV commented 3 months ago

Okay, so what I've been doing is the following: The basic idea is to try to fuzz against as many fonts as possible and check whether the output from harfbuzz matches the output from rustybuzz. For that, we need two things:

A corpus of fonts
A collection of inputs

Regarding 1., the most obvious choice was to download the Google Fonts collection, since it's freely available. The disadvantage is that we are not including CFF fonts this way, but it's still a very solid starting point. So I basically downloaded the fonts and excluded and fonts that contain keywords such as "bold", "italic" etc. so as to not test the same variant of a font multiple times. Overall, this still lead to 1000+ (more or less) unique fonts to choose from.

Regarding 2., in the beginning my idea was to basically generate random sequences of unicode inputs based on the cmap table in the font. However, after a while I realized that this probably wouldn't be very efficient, because the odds of a random sequence triggering specific lookups in GPOS/GSUB is rather low. This made me realize that we should probably base your inputs on real words, since this is what will be shaped in practice after all. Luckily for us, I chanced upon a test corpus by harfbuzz which contains extensive word lists in many different languages scripts scrapped from Wikipedia. A perfect input for what we want to achieve.

I removed some Latin-based scripts because they are more or less the same, and I also truncated some Latin-based input files because they were just so long (the English one had 22 million lines). In the end, I ended up using the following languages with the following corresponding number of lines (=words):

[('or.txt', 42329),
 ('lo.txt', 53644),
 ('pa.txt', 60747),
 ('as.txt', 67460),
 ('dv.txt', 98777),
 ('am.txt', 118901),
 ('ur.txt', 189985),
 ('bo.txt', 208469),
 ('si.txt', 271847),
 ('km.txt', 299124),
 ('mr.txt', 344638),
 ('bn.txt', 354188),
 ('hy.txt', 360014),
 ('gu.txt', 366457),
 ('en.txt', 500000),
 ('ru.txt', 500000),
 ('sr.txt', 500000),
 ('tr.txt', 500000),
 ('vi.txt', 500000),
 ('uk.txt', 500000),
 ('ko.txt', 500000),
 ('is.txt', 500000),
 ('ka.txt', 644301),
 ('hi.txt', 707394),
 ('el.txt', 897133),
 ('kn.txt', 951913),
 ('te.txt', 970573),
 ('ml.txt', 1048334),
 ('ta.txt', 1091754),
 ('bg.txt', 1117542),
 ('my.txt', 1123883),
 ('he.txt', 1332080),
 ('fa.txt', 1360750),
 ('ar.txt', 1850081),
 ('th.txt', 4000000)]

The ones with 500.000 are the ones I truncated because they otherwise would just be too long. You might ask why I then kept some other languages with more than a million, and the simple answer is that

Those languages have less fonts in the corpus, so to counter balance this we use more inputs for each font.
I felt like the odds of finding errors for Latin-based scripts is not much higher with much larger inputs, since the script is pretty simple... Other languages are way more interesting and more error-prone.
See below.

After I had this, the most challenging part was to figure out which font to use for which languages. Trying all combinations is not feasible time-wise and also a waste (e.g. if we tried to shape some Arabic text with a font that only covers Latin characters, we would only get .notdefs anyway).

My basic approach for this was: For each text file, I get a sample of 100 lines. In these 100 lines, I collected all of the characters that appeared. For each font, I check whether it's cmap table covers more than 80% of the characters, and if so I use this combination as a test case. Overall that seemed to work pretty nice, but a problem was that nearly all fonts contained Latin characters, so any Latin-based language would get a lot of fonts, so that's another reason I excluded many Latin-based scripts, and I also ensured that fonts in general are only matched with one language, excluding a number of languages that didn't have many fonts assigned to them. By doing this, I still had a lot of "garbage assignments" (e.g. NotoSansTaiTham being used for English), but at least I could ensure that every font that does support one of the smaller languages is also used for it.

You can find the resulting pairings here: https://gist.github.com/LaurenzV/1d528deabfe4e7d00d248e2f7281482a

And now the last step is to just go through all those combinations and compare the outputs! So far, I've already been able to find around 8 bugs in rustybuzz and (potentially?) 2 bugs in harfbuzz, which is not too impressive but not too bad either. Some of those bugs were really niche though (for example, one was caused by one wrong letter in the indic table!), so I do feel like this is a pretty effective method for testing the crate and should give us much more confidence about its correctness. And I'm still far from being done yet, although the remaining languages are mostly "simple" languages I think where I don't expect too many bugs to be present, but we will see. But I will probably split it up over multiple PRs, depending on how many bugs I can still find.

For each bug I find, I'm also adding a new test case. I try my best to always subset them, but unfortunately so far subsetting the font nearly always "destroyed" the bug, so I had to include the full one. But they are pretty small anyway, so I hope that's okay.

Future work (sorted in priority, although no promises when or even if I will work on it) includes:

Finish going through all combinations above.
Try fuzzing against Windows system fonts.
Try fuzzing against MacOS fonts.
Try to vary other aspects in the fuzzer, such as variation coordinates for variable fonts or text direction.

LaurenzV commented 3 months ago

@RazrFalcon Do you know if there is a particular reason we don't use the unicode_normalization crate for the composing/decomposing characters? Or has just no one bothered to switch to it?

LaurenzV commented 3 months ago

Looks like you consciously removed it: https://github.com/RazrFalcon/rustybuzz/commit/f0e5a766

However, there is a new crate from the icu4x folks, anything speaking against using it directly? The reason I'm asking is that there is something wrong with our current table. 😅 I presume this could be fixed by improving the generation, but I don't see why we should do that if someone else already did it. It does depend on tinyvec, but no dependencies otherwise.

RazrFalcon commented 3 months ago

As you can guess, I do not remember. It was a long time ago. But I do remember that we had some issues with external crates. Either they weren't low-level enough or were producing different output to HB.

If you can replace embedded Unicode tables - I'm all for it.

In general, a rule of thumb when it comes to RB: if something is strange then it's because we had to match HB output.

RazrFalcon commented 3 months ago

Also remember that HB/RB has its own unicode normalization algorithm. We cannot use a third-party crate for that.

LaurenzV commented 3 months ago

Also remember that HB/RB has its own unicode normalization algorithm. We cannot use a third-party crate for that.

Yep, that I know. But perhaps I know the reason why now, it seems like harfbuzz always decomposes a character into 2 units, while the unicode_normalization crate always decomposes as low as possible which could be more than 2... So I'll have to see if I can figure it out.

LaurenzV commented 3 months ago

@behdad Is it expected that HB_NO_OT_RULESETS_FAST_PATH changes the shaping result? With the following font when running

hb-shape NotoSerifGujarati-VariableFont_wght.ttf --no-glyph-names --unicodes U+0ABE,U+0AA8,U+0ACD,U+200D,U+0AA4,U+0ABF I get [414=0+596|60=0+251|61=1+251|186=1+293|3=1+0|38=1+543]

while if I enable HB_NO_OT_RULESETS_FAST_PATH I get [414=0+596|60=0+251|102=1+251|186=1+293|3=1+0|38=1+543]

RazrFalcon commented 3 months ago

it seems like harfbuzz always decomposes a character into 2 units

Yes, this rings a bell.

LaurenzV commented 3 months ago

@RazrFalcon See the description at the top for a more in-depth explanation, I think this first part PR should be ready now (also if possible merge unsquashed, as I tried my best to make each fix a separate commit).

Blocked by https://github.com/RazrFalcon/ttf-parser/pull/164.

RazrFalcon commented 3 months ago

Once again I cannot thank you enough for your work.

I completely agree with your methodology. I've tried fuzzing RB long time ago via AFL fuzz, but it was mostly useless. Simply throwing random data at a shaper doesn't work that well. And guided fuzzing is beyond my level.

If only we had something like resvg-test-suite, but for shaping. HB test suite is close, but as you saw barely scratches the surface.

Also, some of the bugs you have fixed a very strange. No idea how I was able to mess up feature flags like F_MANUAL_ZWJ. This was mostly a copy-pasted code with Rust flavor. Either it was changed later or I've messed up badly.

And no, even a single fixed bug is more than enough. 8 is beyond good. After all, the goal of RB is to be 1:1 with HB.

The disadvantage is that we are not including CFF fonts this way

Google fonts do not use CFF? That's news to me. On the other hand glyf/CFF should not affect shaping in 99% of the cases.

Overall, this still lead to 1000+ (more or less) unique fonts to choose from.

macOS alone has like 800 fonts pre-installed and most of them are insane and worth testing against. You will not be able to include them into tests, aka subset, but it's still worth testing.

LaurenzV commented 3 months ago

You will not be able to include them into tests, aka subset, but it's still worth testing.

We can't include them in the repo, but since we have a MacOS CI now we can test them there. :) But one step at a time. 😄

harfbuzz / rustybuzz

Fix fuzzer-found issues (Part 1) #126