Hebrew combining diacritics aren't positioned correctly

py-pdf / fpdf2

Simple PDF generation for Python

https://py-pdf.github.io/fpdf2/

GNU Lesser General Public License v3.0

1.12k stars 253 forks source link

Hebrew combining diacritics aren't positioned correctly #549

Closed marcstober closed 1 year ago

marcstober commented 2 years ago

Thank you for keeping this open source project going!

I can't get Hebrew combining diacritics ("vowels") to appear correctly, even after looking into the some solutions proposed for similar issues.

For example, here is a Hebrew letter BET with a DAGESH (dot in the middle): בּ

And here is a screen shot from Word:

I've seen some proposed workarounds to similar issues in #490 and experimented with them, as seen in the following code. Here are the results and here's why I think they don't work and this should be tracked as a separate bug:

One part of the solution in #490 is to use arabic_reshaper. I don't think this hurts, but I also don't think it affects Hebrew.
Another part of the solution in #490 is to use bidi.algorithm.get_display. This reverses the order of the characters. I don't think it's actually correct to reverse the order of combining diacritics; they should still come after their base character in the string, even in RTL languages. (This might be something to fix in get_display.) This appears to be what causes the DAGESH to move from being misplaced on one side to being misplaced on the other side of the BET.
There's also a proposed solution in #490 of using Unicode normalization. However, this doesn't work for Hebrew. Hebrew is excluded from the Unicode composition algorithm (see here). Moreover, while the example of BET WITH DAGESH happens to have a composed character, there are very limited basic composed characters (my guess is only what's needed for Yiddish). Most of the combinations of Hebrew with diacritics needed for Biblical and other historic/literary/educational Hebrew purposes do not have composed characters. So, there's still a need to render combining diacritics correctly, and not rely on normalization to solve this.

In theory I'd love to contribute a fix to this but I'm not sure I have the time or knowledge; maybe someone can point me in the right direction? In particular, I wonder if this an issue in FPDF2 itself, or with the font subsetting from fonttools? From what I can tell, the PDF doesn't contain the X and Y position of each diacritic explicitly; rather, it contains the string and the font, and logic in the embedded font provides the exact position within the string. Is that correct?

Here's my sample code. Thanks in advance for your help!

import os
import unicodedata

from fpdf import FPDF

from arabic_reshaper import reshape
from bidi.algorithm import get_display

def debug_string(s, desc):
    print(f"*** {desc} ***")
    for c in s:
        print(c, ord(c), unicodedata.name(c))

def fix_text(some_text):
    debug_string(some_text, "original")

    # Try fixes from discussion on https://github.com/PyFPDF/fpdf2/pull/490
    some_text = unicodedata.normalize('NFC', some_text)

    debug_string(some_text, "normalized (NFC)")

    some_text = get_display(reshape(some_text))

    debug_string(some_text, "reshaper and bidi alorithm fixed")

    return some_text

pdf = FPDF(unit="in", format="Letter")
pdf.add_font("SBL_Hbrw", fname="SBL_Hbrw.ttf")
pdf.set_font("SBL_Hbrw", "", 30)

pdf.add_page()

some_text = "בּ"

pdf.set_xy(1, 1)
pdf.cell(1, 4, some_text)

some_text = fix_text(some_text)

pdf.set_xy(1, 2)
pdf.cell(1, 4, some_text)

filename = "hebrew.pdf"
pdf.output(filename)
os.startfile(filename)  # windows only

Environment

Windows
Python version 3.10.5
fpdf2 version 2.5.7

Lucas-C commented 2 years ago

Hi @marcstober! Thank you for reporting this. Quick answer as I don't have much time for fpdf2 this evening: this may have been reported before, have you checked #540 and the issues mentioned in it?

gmischler commented 2 years ago

have you checked #540 and the issues mentioned in it?

Yes, I suspect that the hebrew fonts handle this by means of the ligature mechanism, and this detailled presentation of the situation there makes me think so even more. So if we're all lucky, we'll eventually get it solved that way.

The bidi question is also interesting as it relates to the diacritics. Right now, we're simply expecting the user to supply the text pre-bidified. Once the ligature handling is in place, we'll have to figure out which of the two needs to happen first to give the right result. @marcstober, I hope you'll stick around a while, since it is going to take some time. But once we're there, we'll need various language experts to check that the result is actually correct...

marcstober commented 2 years ago

While #540 does seem related, I think this belongs as a separate issue. Both of them seem to be about using more advanced font features, but #540 is about better supporting ligatures and this issue is about better supporting combining diacritics. So, I don't think resolving one of these things will necessarily resolve the other, although it's possible they'll require similar changes to the code.

semaeostomea commented 2 years ago

I hope it's alright for me to chime in, since I wrote the temporary RTL script fix

I don't think this hurts, but I also don't think it affects Hebrew.

You're right. This fix is supposed to be used as a whole (both libs) when using any kind of RTL script. The reshaper doesn't do anything to non-arabic scripts

I don't think it's actually correct to reverse the order of combining diacritics

this is an issue with displaced diacritics in general (this does not only affect hebrew, it also affects Diné Bizaad and Czech for example), which, as far as I understand it, have the same underlying problem as the ligatures. In a way these characters are ligatures, too, since the diacritics have separate code points and are combined with the base character into one. I think it could still get fixed by #540

I actually mentioned the problem with the displaced diacritics in #490 and the documentation too, you can see an example for both Hebrew and Diné Bizaad if you scroll down in #490. There are fonts with which the diacritics are less displaced or even placed correctly. I found a font with which I could display Diné Bizaad without any issues eventually, so maybe this could be a temporary fix for you till ligatures are supported. I didn't look for Hebrew fonts because I'm waiting on the ligature support, but you might find one on the google fonts website

gmischler commented 2 years ago

but #540 is about better supporting ligatures and this issue is about better supporting combining diacritics.

When I talk about "ligatures" as a technical term, what I usually mean would be more accurately termed "glyph substitution". What in typography is considered a ligature is only one of the possible applications of that concept. There's no (technical) reason this couldn't be applied to diacritics as well, especially if they result in a typographical change of the base character.

But looking more closely, we may indeed be dealing with something else here. For the hebrew dagesh, WP lists the following example combinations:

Combining characters:

bet + dagesh: בּ בּ = U+05D1 U+05BC
kaf + dagesh: כּ כּ = U+05DB U+05BC
pe + dagesh: פּ פּ = U+05E4 U+05BC

Precomposed characters:

bet with dagesh: בּ בּ = U+FB31
kaf with dagesh: כּ כּ = U+FB3B
pe with dagesh: פּ פּ = U+FB44

Strangely, the unicodedata module will split the latter into the former, no matter which normalization form I select:

>>> [ord(c) for c in unicodedata.normalize("NFC", "\ufb31")]
[1489, 1468]
>>> [ord(c) for c in unicodedata.normalize("NFKC", "\ufb31")]
[1489, 1468]
>>> [ord(c) for c in unicodedata.normalize("NFD", "\ufb31")]
[1489, 1468]
>>> [ord(c) for c in unicodedata.normalize("NFKD", "\ufb31")]
[1489, 1468]

But there doesn't seem to be a built-in way to convert back to the combined form.

This is very weird, and actually looks like a bug in the python unicode database. Or can anyone think of a valid reason for this behaviour? Is it documented somewhere?

On the positive side, a workaround could be very simple.

def test_hebrew():
    pdf = FPDF()
    pdf.add_page()
    pdf.add_font(family="Narkisim", style="",
                       fname="c:/windows/fonts/nrkis.ttf")
    pdf.set_font("Narkisim", "", 24)
    pdf.cell(txt="decomposed: \u05bc\u05d1")
    pdf.ln()
    pdf.cell(txt="composed: \ufb31")
    pdf.output("test_hebrew.pdf")

Resulting in this PDF output: grafik

Following the pattern of arabic_reshaper, we could simply scan the text and replace all relevant sequences with the combined forms. Actually, I seem to vaguely remember that hebrew also has positional shapes, is this correct? In any case, we could probably extend arabic_reshaper into a more generic reshaper module, that handles other languages as well.

In the very short term, documenting this would be a first step. If one of the language experts could collect a list of all the sequences that need to be combined, that would be very helpful. In the longer term, we'll probably add this as another step in the text preprocessing outlined in #540, essentially adding our own substitution lookup to the one provided by the font data.

gmischler commented 2 years ago

Ok, found https://www.unicode.org/reports/tr15/#Primary_Exclusion_List_Table, which explains that certain characters are excluded from recombining for "stability" reasons. Apparently it is more important that the standard never changes than getting the correct output.

That makes it official that every software package needs to recombine those diacritics on its own. Now we just need a list of which those are...

marcstober commented 2 years ago

Hi @gmischler , thanks for looking into this. What I gave is not a great example because it's too trivial. It happens to have to have a composed form which is part of that exclusion list. Hebrew letters can have multiple diacritics and there are only composed forms for a small portion of the many possible combinations. I don't believe that well-designed Hebrew fonts use glyph substitution, but rather that there is positioning logic in the font that is getting lost when the font is subsetted.

Looking at the code some more, I wonder if it's because the GPOS table is getting dropped here - I may try commenting that out in a fork of the code and see what happens: https://github.com/PyFPDF/fpdf2/blob/master/fpdf/fpdf.py#L4160

The font I used in my example, SBL Hebrew, was designed for academic purposes where diacritics are very important. There is a whole PDF manual on that site that explain how to use it with examples of advanced use of diacritics and discussion of normalization and other technical font issues, so the correct logic should already be in the font, but somehow FPDF isn't using it.

@semaeostomea , I did see the Hebrew issue mentioned in #490 and that actually did help with RTL text in general, but the misplaced diacritics seemed like it needed a separate open issue.

Here's a more complex example, you can see that the fix from #490 does put the base characters (consonants) in the correct right-to-left order:

But here's what it should look like (and what it looks like in Word and PowerPoint):

By positional shapes, I think you might mean the 5 "final letters." My understanding is that people generally don't expect the software to handle that - they have their own keys on the standard Hebrew keyboard layouts and people just type them.

import os
import unicodedata

from fpdf import FPDF

from arabic_reshaper import reshape
from bidi.algorithm import get_display

def debug_string(s, desc):
    print(f"*** {desc} ***")
    for c in s:
        print(c, ord(c), unicodedata.name(c))

def fix_text(some_text):
    debug_string(some_text, "original")

    # Try fixes from discussion on https://github.com/PyFPDF/fpdf2/pull/490
    some_text = unicodedata.normalize('NFC', some_text)

    debug_string(some_text, "normalized (NFC)")

    some_text = get_display(reshape(some_text))

    debug_string(some_text, "reshaper and bidi alorithm fixed")

    return some_text

pdf = FPDF(unit="in", format="Letter")
pdf.add_font("SBL_Hbrw", fname="SBL_Hbrw.ttf")
pdf.set_font("SBL_Hbrw", "", 30)

pdf.add_page()

some_text = "בְּרֵאשִׁ֖ית"

pdf.set_xy(1, 1)
pdf.cell(1, 4, "No fix: " + some_text)

some_text = fix_text(some_text)

pdf.set_xy(1, 1.75)
pdf.cell(1, 4, "Fix from #490: " + some_text)

filename = "hebrew.pdf"
pdf.output(filename)
os.startfile(filename)  # windows only

gmischler commented 2 years ago

What I gave is not a great example because it's too trivial. It happens to have to have a composed form which is part of that exclusion list.

It is still a helpful example for solving one part of the puzzle. In fact, those combined glyphs that do exist will serve as a good testing ground once we are getting a general substitution mechanism in place. That will serve as an intermediate step before we actually look at the "gsub" table. I hope @Redshy (who did the fontTools transition) will find enough time to participate in this soon.

Looking at the code some more, I wonder if it's because the GPOS table is getting dropped here - I may try commenting that out in a fork of the code and see what happens:

I'd be surprised if that works, but definitively not unhappy... My (unverified) expectation is that we need to take this information into account when actually placing the glyphs. We may have to take it into account in either case though, in order to correctly determine the width of a string.

Looking at the gpos specs, it will take quite some thought and a lot of experimenting to get this right. But it will also help with Thai (#459) and other languages.

The font I used in my example, SBL Hebrew, was designed for academic purposes where diacritics are very important. There is a whole PDF manual on that site that explain how to use it with examples of advanced use of diacritics and discussion of normalization and other technical font issues, so the correct logic should already be in the font, but somehow FPDF isn't using it.

That looks very helpful!

By positional shapes, I think you might mean the 5 "final letters." My understanding is that people generally don't expect the software to handle that - they have their own keys on the standard Hebrew keyboard layouts and people just type them.

If it can be done with arabic, then we can do it the same way with other languages. If the input text already has the final form, then nothing will happen to it. Maybe an option to turn the feature on and off for individual languages would be useful if someone actually wants to print the unchanged form. We'll probably have something like a .set_fontopts() method at some point for doing things like that.

Lucas-C commented 1 year ago

@andersonhc PR https://github.com/PyFPDF/fpdf2/pull/820 has been merged today.

Could you test if that solved your issue @marcstober?

You can install fpdf2 directly from the master branch of this repo with this command:

pip install git+https://github.com/PyFPDF/fpdf2.git@master

The documentation is there: https://pyfpdf.github.io/fpdf2/TextShaping.html

marcstober commented 1 year ago

Thanks for keeping the work on this issue going. Unfortunately, it still doesn't seem to be fixed for Hebrew diacritics. I'll try to take a closer look at an see if I can be of any help. I'm still getting the same results as in https://github.com/py-pdf/fpdf2/issues/549#issuecomment-1253132013

On Wed, Aug 2, 2023 at 6:45 AM Lucas Cimon @.***> wrote:

@andersonhc https://github.com/andersonhc PR #820 https://github.com/PyFPDF/fpdf2/pull/820 has been merged today.

Could you test if that solved your issue @marcstober https://github.com/marcstober?

You can install fpdf2 directly from the master branch of this repo with this command:

pip install @.***

The documentation is there: https://pyfpdf.github.io/fpdf2/TextShaping.html

— Reply to this email directly, view it on GitHub https://github.com/PyFPDF/fpdf2/issues/549#issuecomment-1661982066, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGDFWBYH7ND4UWEZ4PBCUDXTIVWNANCNFSM6AAAAAAQQGEATI . You are receiving this because you were mentioned.Message ID: @.***>

-- Marc Stober, MJEd, MS cantor, coder, teacher www.marcstober.com 617-694-2884

andersonhc commented 1 year ago

Thanks for keeping the work on this issue going. Unfortunately, it still doesn't seem to be fixed for Hebrew diacritics. I'll try to take a closer look at an see if I can be of any help. I'm still getting the same results as in #549 (comment)

Can you test with the new fpdf2 version:

import os
import unicodedata

from fpdf import FPDF

pdf = FPDF(unit="in", format="Letter")
pdf.add_font("SBL_Hbrw", fname="SBL_Hbrw.ttf")
pdf.set_font("SBL_Hbrw", "", 30)

pdf.add_page()
pdf.set_text_shaping(True)
some_text = "בְּרֵאשִׁ֖ית"

pdf.set_xy(1, 1)
pdf.cell(1, 4, some_text)

filename = "hebrew.pdf"
pdf.output(filename)
os.startfile(filename)  # windows only

marcstober commented 1 year ago

It works! With a caveat. It doesn't seem to work if a paragraph (i.e., a cell) has a combination of English and Hebrew (or probably any mix of LTR and RTL scripts). This is fairly common with Hebrew. The Harfbuzz documentation even says that this is something Harbuzz itself doesn't do. Should this go into a separate issue?

I tried using the get_display method discussed in #490 but that didn't fix the mixed Hebrew/English paragraphs, and it broke the Hebrew-only paragraphs.

gmischler commented 1 year ago

I tried using the get_display method discussed in #490 but that didn't fix the mixed Hebrew/English paragraphs, and it broke the Hebrew-only paragraphs.

I think #820 does the get_display() internally now, and it probably doesn't work correctly when called twice. It also seems designed to only process one language at a time, so your results are not really surprising.

There are two general ways to solve this: a) Feed in the text one language at a time. b) automatically do an analysis of the Unicode code points received and split the string into language/script specific chunks. Both require a way to format paragraphs containing different fonts, which is not currently available.

But this does give me another opportunity to plug #339. :wink: Yes, I know that one has been "coming soon" for a long time now, but I'm actively working at it again at the moment, so it's not just vaporware... If I don't botch it up, it should help to resolve many of the higher-level issues we currently have in the pipeline (multi-language paragraphs, format changes within table cells, HTML formatting, etc.).

Btw. @andersonhc, have you done any tests on how the text shaping functionality interacts with special formatting options like set_stretching(), set_char_spacing(), etc.? I haven't checked myself, but there's a theoretical chance that those might rip apart composed ligatures and stacked accents. I really hope that PDF readers (and the specs) are smart enough to avoid that!

(Thinking about it, I'm really not sure what set_char_spacing() would do to eg. devanagari in general, or if it makes any kind of sense to use the two together.)

andersonhc commented 1 year ago

Btw. @andersonhc, have you done any tests on how the text shaping functionality interacts with special formatting options like set_stretching(), set_char_spacing(), etc.? I haven't checked myself, but there's a theoretical chance that those might rip apart composed ligatures and stacked accents. I really hope that PDF readers (and the specs) are smart enough to avoid that!

I did some tests, stretching tends to work pretty well but char spacing create minor glitches on some cases.

andersonhc commented 1 year ago

It works! With a caveat. It doesn't seem to work if a paragraph (i.e., a cell) has a combination of English and Hebrew (or probably any mix of LTR and RTL scripts). This is fairly common with Hebrew. The Harfbuzz documentation even says that this is something Harbuzz itself doesn't do. Should this go into a separate issue?

You can open an issue. I believe we can fix it implementing part of BIDI algorithm into fpdf2 and break each direction change into different "Fragments" so we pass them separately to harfbuzz.

marcstober commented 1 year ago

Yes. Thanks for your work on this and for opening #882. I'm OK with closing this issue now and leaving #882 open. Breaking text into fragments makes sense. (Would that same fragments logic help support shaped text with multiple fonts and styles, too?)

get_display() actually still works to put all the "consonants" in the correct order, but the vowels/diacritics still don't show up correctly. In any case, #882 seems like the place to address that.

I also discovered some cases when using text shaping was resulting in no text output or question-mark-in-box characters. My guess is that this is an edge case bug involving font subsetting. I'll open a separate issue.

Lucas-C commented 1 year ago

Thank you for the feedback @marcstober!

Closing this now