Arabic text - Githubissues

paulmelnikow commented 5 years ago

This library does not correctly measure the width of Arabic text. Arabic is a cursive script whose character widths depend highly on context.

badges/shields#4140 discusses several related issues, though the one to solve here is the width computation.

Here is a simple example (the words for left and right) from the test case @Aissaoui-Ahmed posted:

يسار يمين

Ideally this library would return the correct widths for Arabic text. I'm not sure how much work it would be, or what the tradeoffs would be in terms of file size and memory usage, though it is worth seeing if we can do this.

In the past, the way we've gone about computing width adjustments is through kerning pairs. We identify pairs of letters which cause the combined length of a pair of letters to be shorter than the individual letters combined. Then those adjustments can be included in the width computation. The current code isn't doing that, because the unadjusted widths have been good enough for our purposes, though it would be easy enough to add that back in.

Could that approach work here?

If not (and if it seems too intractable to solve the problem for all spellings) could we start with a solution that handles many but not all words? Could we generate a list of words which should definitely be computed correctly… and maybe a few that we're willing to let go?

/cc @chris48s who is also participating in the thread

chris48s commented 5 years ago

I think the only additional insight I can give on this is that this isn't like an edge case that just applies to Arabic. There are other languages like Urdu and Farsi that have this property where the appearance (and hence width) of a character depends on where it is in the word/sentence.

I've not really spent any significant time looking at the code for this module, but based on the snippet you linked to, it seems like the kerningPair code essentially uses a DOM to "render" two characters next to each other to work out an adjustment/offset to apply to the values from the character lookup table. Is that right?

If we've got the ability to do that, it seems like a more accurate/less painful way might be to just "render" the entire string and measure its width rather than trying to calculate a series of offsets? Perhaps one optimisation that could be applied is to divide the character set into fixed/variable width characters and use the more efficient lookup table-based approach if our string contains only constant width chars but the less efficient approach if it doesn't.

Splitting the whole Unicode space in this way is probably really really hard, so a starting point is maybe to produce an "allow list" containing something like [latin alphabet + numbers + some common punctuation] which we know are intelligible when treated as fixed-width, have that use the more efficient method and then assume everything else could be variable and fall back to the less efficient but more accurate method? The exact contents of the allow list can be extended in time..

Any of that seem vaguely sensible?

paulmelnikow commented 5 years ago

This approach is a total hack, and there are indeed a lot of languages where it doesn’t work well.

A lookup table with fallback to computing individual strings is how we used to do it. The drawback is that the server needs to have the (non-free) font installed, and needs to run the heavy dependency like Puppeteer or PDFKit. It makes it more complicated to host Shields.

An option could be to stuff fallback text-width computation into its own service with its own caching.

I think it may also be possible to work out widths using kerning pairs or similar heuristics for one language at a time. Though that’s not going to be a perfect, and for time and space reasons it needs to be tackled one language at a time.

We could also omit textWidth when we suspect an incorrect width, which would produce less wacky results than what we’re seeing now.

chris48s commented 5 years ago

The drawback is that the server needs to have the (non-free) font installed, and needs to run the heavy dependency like Puppeteer or PDFKit. It makes it more complicated to host Shields.

Right, got it.. so the kerningPair code that currently exists can be used while building the static lookup table, but not on-the-fly, right?

We could also omit textWidth when we suspect an incorrect width

Can you give an example of what the impact of this would be?

paulmelnikow commented 5 years ago

Right, got it.. so the kerningPair code that currently exists can be used while building the static lookup table, but not on-the-fly, right?

Yes, I think? The kerning pair code creates a second static lookup table, which in turn could be used on the fly. However the underlying measuring code that handles arbitrary strings uses Puppeteer and Verdana, which we don't want to use on the fly.

We could also omit textWidth when we suspect an incorrect width

Can you give an example of what the impact of this would be?

I think it's that the badges would be vulnerable to the issues described in badges/shields#848 and badges/shields#746. (See longer description at https://github.com/badges/shields/issues/4140#issuecomment-538642240)

chris48s commented 5 years ago

Here's another thought..

Obviously the ideal outcome is we assign a string the correct width, but the thing we really don't want to do is assign insufficient width. i.e: if the string needs 100px to display and we say the string is only 90px long, that's a problem. If we say its 110px long, we're going to render too much spacing but we'll at least give the string enough room to render and not overflow.

I wonder how good/bad it would be if we assume a worst-case scenario and assign cursive script characters the width of the widest variant? That would at least be safe, if inaccurate, or is that essentially what we're trying to improve on?

paulmelnikow commented 5 years ago

Yes, I agree with that.

However, for that to work, we'd need to stop specifying the exact width of the text, which is what fixes badges/shields#848 and badges/shields#746.

We could try to detect cursive script (maybe in anafanafo) and emit a flag that says "this length may be suspect." Then in Shields when that happens we decide we may be wrong about the width, and render without the text width so the cursive renders correctly.

Badges with cursive would look a little bit wrong when the minimum font size is very high, though that would be better than being completely unreadable regardless of font size.

metabolize / anafanafo

Arabic text #114