mgmeyers / obsidian-zotero-integration

Insert and import citations, bibliographies, notes, and PDF annotations from Zotero into Obsidian.
GNU General Public License v3.0
912 stars 50 forks source link

`annotation.comment` swallows text if it contains line breaks #367

Open putzwasser opened 1 month ago

putzwasser commented 1 month ago

annotation.comment swallows text if it contains line breaks:

The PDF annotation

image

The text in the PDF is fine. So is the text that gets copied into the annotation. So, this is not an OCR problem. (The PDF isn't OCR'ed anyway)

Data Explorer Output

image

The data explorer output shows that the text that was highlighted gets picked up correctly (annotation.annotatedText). For some reason the text that gets copied into the annotation (annotation.comment) gets crippled.

This happens around the newline.

Another problem are word breaks:

image

Expected Behavior

Non-crippled text.

FeralFlora commented 4 weeks ago

the text that gets copied into the annotation (annotation.comment) gets crippled.

What does "crippled" mean in this context?

putzwasser commented 4 weeks ago

By crippled I mean that text is swallowed/removed . See the first screenshot:

annotatedText is correct:

This article identifies the demand

comment is incorrect:

This article identiesthe demand

➡️ the fi in identifies is swallowed.

My best guess: comment ignores or removes ligatures. That is when letters are joined to form a single glyph.

Instead of the text f and i it is , which is Unicode char U+FB01 and only a single char (try to select only the f or the i in it won't work.

putzwasser commented 4 weeks ago

Upon further testing it seems that comment ignores any non ASCII characters. Chars like äöü or é, è, ê or ë get stripped from the text, too.

Add this text as comment/annotation/note to a PDF

charÄÜÖüäöactéèêëers

Import the annotations. It will read characters, because UTF-8 chars got stripped.

Use this PDF with the described cases (ligature and UTF-8 chars) for testing:

test.pdf