Open putzwasser opened 6 months ago
the text that gets copied into the annotation (
annotation.comment
) gets crippled.
What does "crippled" mean in this context?
By crippled I mean that text is swallowed/removed . See the first screenshot:
annotatedText
is correct:
This article identifies the demand
comment
is incorrect:
This article identiesthe demand
➡️ the fi in identifies is swallowed.
My best guess: comment
ignores or removes ligatures. That is when letters are joined to form a single glyph.
Instead of the text f and i it is fi, which is Unicode char U+FB01 and only a single char (try to select only the f or the i in fi
it won't work.
Upon further testing it seems that comment
ignores any non ASCII characters. Chars like äöü or é, è, ê or ë get stripped from the text, too.
Add this text as comment/annotation/note to a PDF
charÄÜÖüäöactéèêëers
Import the annotations. It will read characters
, because UTF-8 chars got stripped.
Use this PDF with the described cases (ligature and UTF-8 chars) for testing:
annotation.comment
swallows text if it contains line breaks:The PDF annotation
The text in the PDF is fine. So is the text that gets copied into the annotation. So, this is not an OCR problem. (The PDF isn't OCR'ed anyway)
Data Explorer Output
The data explorer output shows that the text that was highlighted gets picked up correctly (
annotation.annotatedText
). For some reason the text that gets copied into the annotation (annotation.comment
) gets crippled.This happens around the newline.
Another problem are word breaks:
annotation.comment
doesn't replace them properlyannotation.annotatedText
does:Expected Behavior
Non-crippled text.