`annotation.comment` swallows text if it contains line breaks

mgmeyers / obsidian-zotero-integration

Insert and import citations, bibliographies, notes, and PDF annotations from Zotero into Obsidian.

GNU General Public License v3.0

1.09k stars 55 forks source link

`annotation.comment` swallows text if it contains line breaks #367

Open putzwasser opened 6 months ago

putzwasser commented 6 months ago

annotation.comment swallows text if it contains line breaks:

The PDF annotation

The text in the PDF is fine. So is the text that gets copied into the annotation. So, this is not an OCR problem. (The PDF isn't OCR'ed anyway)

Data Explorer Output

The data explorer output shows that the text that was highlighted gets picked up correctly (annotation.annotatedText). For some reason the text that gets copied into the annotation (annotation.comment) gets crippled.

This happens around the newline.

Another problem are word breaks:

annotation.comment doesn't replace them properly
annotation.annotatedText does:

Expected Behavior

Non-crippled text.

FeralFlora commented 5 months ago

the text that gets copied into the annotation (annotation.comment) gets crippled.

What does "crippled" mean in this context?

putzwasser commented 5 months ago

By crippled I mean that text is swallowed/removed . See the first screenshot:

annotatedText is correct:

This article identifies the demand

comment is incorrect:

This article identiesthe demand

➡️ the fi in identifies is swallowed.

My best guess: comment ignores or removes ligatures. That is when letters are joined to form a single glyph.

Instead of the text f and i it is ﬁ, which is Unicode char U+FB01 and only a single char (try to select only the f or the i in ﬁ it won't work.

putzwasser commented 5 months ago

Upon further testing it seems that comment ignores any non ASCII characters. Chars like äöü or é, è, ê or ë get stripped from the text, too.

Add this text as comment/annotation/note to a PDF

charÄÜÖüäöactéèêëers

Import the annotations. It will read characters, because UTF-8 chars got stripped.

Use this PDF with the described cases (ligature and UTF-8 chars) for testing:

test.pdf