Closed LaurenzV closed 3 months ago
@RazrFalcon Do you know if there is a particular reason we don't use the unicode_normalization
crate for the composing/decomposing characters? Or has just no one bothered to switch to it?
Looks like you consciously removed it: https://github.com/RazrFalcon/rustybuzz/commit/f0e5a766
However, there is a new crate from the icu4x folks, anything speaking against using it directly? The reason I'm asking is that there is something wrong with our current table. π
I presume this could be fixed by improving the generation, but I don't see why we should do that if someone else already did it. It does depend on tinyvec
, but no dependencies otherwise.
As you can guess, I do not remember. It was a long time ago. But I do remember that we had some issues with external crates. Either they weren't low-level enough or were producing different output to HB.
If you can replace embedded Unicode tables - I'm all for it.
In general, a rule of thumb when it comes to RB: if something is strange then it's because we had to match HB output.
Also remember that HB/RB has its own unicode normalization algorithm. We cannot use a third-party crate for that.
Also remember that HB/RB has its own unicode normalization algorithm. We cannot use a third-party crate for that.
Yep, that I know. But perhaps I know the reason why now, it seems like harfbuzz always decomposes a character into 2 units, while the unicode_normalization
crate always decomposes as low as possible which could be more than 2... So I'll have to see if I can figure it out.
@behdad Is it expected that HB_NO_OT_RULESETS_FAST_PATH
changes the shaping result? With the following font when running
hb-shape NotoSerifGujarati-VariableFont_wght.ttf --no-glyph-names --unicodes U+0ABE,U+0AA8,U+0ACD,U+200D,U+0AA4,U+0ABF
I get
[414=0+596|60=0+251|61=1+251|186=1+293|3=1+0|38=1+543]
while if I enable HB_NO_OT_RULESETS_FAST_PATH
I get
[414=0+596|60=0+251|102=1+251|186=1+293|3=1+0|38=1+543]
it seems like harfbuzz always decomposes a character into 2 units
Yes, this rings a bell.
@RazrFalcon See the description at the top for a more in-depth explanation, I think this first part PR should be ready now (also if possible merge unsquashed, as I tried my best to make each fix a separate commit).
Blocked by https://github.com/RazrFalcon/ttf-parser/pull/164.
Once again I cannot thank you enough for your work.
I completely agree with your methodology. I've tried fuzzing RB long time ago via AFL fuzz, but it was mostly useless. Simply throwing random data at a shaper doesn't work that well. And guided fuzzing is beyond my level.
If only we had something like resvg-test-suite
, but for shaping. HB test suite is close, but as you saw barely scratches the surface.
Also, some of the bugs you have fixed a very strange. No idea how I was able to mess up feature flags like F_MANUAL_ZWJ
. This was mostly a copy-pasted code with Rust flavor. Either it was changed later or I've messed up badly.
And no, even a single fixed bug is more than enough. 8 is beyond good. After all, the goal of RB is to be 1:1 with HB.
The disadvantage is that we are not including CFF fonts this way
Google fonts do not use CFF? That's news to me.
On the other hand glyf
/CFF
should not affect shaping in 99% of the cases.
Overall, this still lead to 1000+ (more or less) unique fonts to choose from.
macOS alone has like 800 fonts pre-installed and most of them are insane and worth testing against. You will not be able to include them into tests, aka subset, but it's still worth testing.
You will not be able to include them into tests, aka subset, but it's still worth testing.
We can't include them in the repo, but since we have a MacOS CI now we can test them there. :) But one step at a time. π
Okay, so what I've been doing is the following: The basic idea is to try to fuzz against as many fonts as possible and check whether the output from harfbuzz matches the output from rustybuzz. For that, we need two things:
Regarding 1., the most obvious choice was to download the Google Fonts collection, since it's freely available. The disadvantage is that we are not including CFF fonts this way, but it's still a very solid starting point. So I basically downloaded the fonts and excluded and fonts that contain keywords such as "bold", "italic" etc. so as to not test the same variant of a font multiple times. Overall, this still lead to 1000+ (more or less) unique fonts to choose from.
Regarding 2., in the beginning my idea was to basically generate random sequences of unicode inputs based on the
cmap
table in the font. However, after a while I realized that this probably wouldn't be very efficient, because the odds of a random sequence triggering specific lookups in GPOS/GSUB is rather low. This made me realize that we should probably base your inputs on real words, since this is what will be shaped in practice after all. Luckily for us, I chanced upon a test corpus by harfbuzz which contains extensive word lists in many different languages scripts scrapped from Wikipedia. A perfect input for what we want to achieve.I removed some Latin-based scripts because they are more or less the same, and I also truncated some Latin-based input files because they were just so long (the English one had 22 million lines). In the end, I ended up using the following languages with the following corresponding number of lines (=words):
The ones with 500.000 are the ones I truncated because they otherwise would just be too long. You might ask why I then kept some other languages with more than a million, and the simple answer is that
After I had this, the most challenging part was to figure out which font to use for which languages. Trying all combinations is not feasible time-wise and also a waste (e.g. if we tried to shape some Arabic text with a font that only covers Latin characters, we would only get
.notdefs
anyway).My basic approach for this was: For each text file, I get a sample of 100 lines. In these 100 lines, I collected all of the characters that appeared. For each font, I check whether it's cmap table covers more than 80% of the characters, and if so I use this combination as a test case. Overall that seemed to work pretty nice, but a problem was that nearly all fonts contained Latin characters, so any Latin-based language would get a lot of fonts, so that's another reason I excluded many Latin-based scripts, and I also ensured that fonts in general are only matched with one language, excluding a number of languages that didn't have many fonts assigned to them. By doing this, I still had a lot of "garbage assignments" (e.g. NotoSansTaiTham being used for English), but at least I could ensure that every font that does support one of the smaller languages is also used for it.
You can find the resulting pairings here: https://gist.github.com/LaurenzV/1d528deabfe4e7d00d248e2f7281482a
And now the last step is to just go through all those combinations and compare the outputs! So far, I've already been able to find around 8 bugs in
rustybuzz
and (potentially?) 2 bugs inharfbuzz
, which is not too impressive but not too bad either. Some of those bugs were really niche though (for example, one was caused by one wrong letter in the indic table!), so I do feel like this is a pretty effective method for testing the crate and should give us much more confidence about its correctness. And I'm still far from being done yet, although the remaining languages are mostly "simple" languages I think where I don't expect too many bugs to be present, but we will see. But I will probably split it up over multiple PRs, depending on how many bugs I can still find.For each bug I find, I'm also adding a new test case. I try my best to always subset them, but unfortunately so far subsetting the font nearly always "destroyed" the bug, so I had to include the full one. But they are pretty small anyway, so I hope that's okay.
Future work (sorted in priority, although no promises when or even if I will work on it) includes: