mgmeyers / pdfannots2json

GNU Affero General Public License v3.0
42 stars 5 forks source link

Bug: extraction in certain PDFs all jumbled up #11

Open chrisgrieser opened 2 years ago

chrisgrieser commented 2 years ago

I noticed that with certain PDFs, it seems that the content is all jumbled up, with all kinds of issues:

The resulting json naturally leads to various issues that prevent me from properly extracting with my Alfred workflow.

I checked, the extraction works fine with other PDFs, and it works fine when using a different method of pdf annotation extraction like for example the built-in features of Highlights or PDF Expert.


Here is a sample of a PDF in question, together with the PDF output I am getting via Highlights, and the json generated by pdfannots2json.

mgmeyers commented 2 years ago

@chrisgrieser I'll take a look. Text extraction seems to work better with some PDFs than others and I haven't been able to figure out why.

chrisgrieser commented 2 years ago

I have another case where the extraction is all wrong (e.g. all spaces missing), when it works with other PDF annotation extractors. You want me to add samples of PDFs with issues?

mgmeyers commented 2 years ago

@chrisgrieser I was able to improve the accuracy of extraction quite a bit, and fix issues with spaces getting condensed. Can you update and test it out?

chrisgrieser commented 2 years ago

seems to be a bit better, but not for all PDFs. will report when I tried it on more PDFs!

mgmeyers commented 2 years ago

@chrisgrieser I've improved accuracy even more. Be sure to update to 1.0.9 when testing things

tim-hilde commented 1 year ago

I've just come across the same issue where the extracted text is missing all spaces:

sample.pdf

tim-hilde commented 1 year ago

@mgmeyers this ones especially weird as it extracts completely different text: gallieEssentiallyContestedConcepts1956.pdf

tim-hilde commented 1 year ago

To add to this: mingers.walshamEthicalInformationSystems2010.pdf This is also completely weird. Would love some form of fixing as it very much hinders me to use the Zotero Integration plugin

GregoryBridgett commented 1 year ago

I also notice quite a few examples where I get missing spaces or duplicated fragments of words. Could it be due to the watermark that's placed on the PDFs by the closed source library?

I'd be happy to provide examples and to test code to help debug.

tim-hilde commented 1 year ago

Had a new case of this just now. In this case the rectangle extraction even cut wrongly

jhs1965net commented 1 year ago

In th last version os OS X The hot key doesn't work