mgmeyers / pdfannots2json

GNU Affero General Public License v3.0
45 stars 6 forks source link

UTF Encoding Fix for Annotation Comments #21

Open theotheo opened 11 months ago

theotheo commented 11 months ago

Hello,

I encountered an issue when using Cyrillic characters in annotations: during export, the text was transformed into an unreadable set of characters. My knowledge of PDF is not sufficient to confidently pinpoint the exact cause of the problem. However, some code experiments helped me find a simple solution that appears to resolve the issue.

I will illustrate the problem with a specially created PDF file with annotations: Example.pdf. For maximum clarity, I will also provide screenshot image So, the screenshot shows a PDF with 2 lines of text and 2 annotated annotations in which Latin characters are combined with Cyrillic.

The export of this document looks as follows:

  [
    {
        "annotatedText": "text",
        "color": "#ffff00",
        "colorCategory": "Yellow",
        "comment": "This is :\u003e\u003c\u003c5=B0@89",
        "date": "2023-10-23T20:33:08+03:00",
        "id": "highlight-p1x90y719",
        "page": 1,
        "pageLabel": "1",
        "type": "highlight",
        "x": 90.67,
        "y": 719.27
    },
    {
        "annotatedText": "текст",
        "color": "#00ff00",
        "colorCategory": "Green",
        "comment": "-B\u003e a comment",
        "date": "2023-10-23T20:33:35+03:00",
        "id": "highlight-p1x90y696",
        "page": 1,
        "pageLabel": "1",
        "type": "highlight",
        "x": 90.67,
        "y": 696.66
    }
]

As you can see, the "comment" fields contain unreadable characters (where Unicode can be guessed).

The updated version produces results with correctly encoded characters:

[
    {
        "annotatedText": "text",
        "color": "#ffff00",
        "colorCategory": "Yellow",
        "comment": "This is комментарий",
        "date": "2023-10-23T20:33:08+03:00",
        "id": "highlight-p1x90y719",
        "page": 1,
        "pageLabel": "1",
        "type": "highlight",
        "x": 90.67,
        "y": 719.27
    },
    {
        "annotatedText": "текст",
        "color": "#00ff00",
        "colorCategory": "Green",
        "comment": "Это a comment",
        "date": "2023-10-23T20:33:35+03:00",
        "id": "highlight-p1x90y696",
        "page": 1,
        "pageLabel": "1",
        "type": "highlight",
        "x": 90.67,
        "y": 696.66
    }
]

This resolves the issue with unreadable characters in the comments.

P.S. This may not be crucial, but I'd like to mention that I'm using this project through your Obsidian-Zotero-Integrator. I should also note that I use Okular for annotation.