modesty / pdf2json

converts binary PDF to JSON and text, for server-side PDF processing and command-line use.
https://github.com/modesty/pdf2json
Other
1.97k stars 376 forks source link

Problem parsing PDF document, text not recognized. #152

Open theoklpd opened 6 years ago

theoklpd commented 6 years ago

url: http://www.delindeschemolen.nl/PDF%20bestanden/DOC082_2014.pdf

When performing a tryout of the pdf2json library, the PDF document in the above mentioned url. was not parsed correctly. No text was found in the document.

wanghaisheng commented 6 years ago

please install poppler-util and use pdfinfo tool to inspect the original pdf file

theoklpd commented 6 years ago

Thanks for your reply. I've tried (on Windows 10) Poppler's pdfinfo.exe on the following pdf document: DOC082_2014.pdf and got the following output: DOC082_2014_Poppler_Info_output.txt

So apparently the PDF is valid. Also Poppler's pdftotext.exe extracts the text elements from the PDF.

Therefore my question is. Why does the pdf2jon library does not recognize the pdf (text elements) ?

theoklpd commented 6 years ago

Sorry, Closed by mistake.

wanghaisheng commented 6 years ago

try pdffonts to see more i

theoklpd commented 6 years ago

Okay, more info from pdffont comming up: DOC082_2014_Poppler_Font_output.txt

wanghaisheng commented 6 years ago

I am meant to try your pdf against latest pdf.js , instead i choose https://github.com/SyslogicNL/pdf-extractor.git ,a wrapper around pdf.js and it seems your input file is fine .all the text can be extracted.

theoklpd commented 6 years ago

Please note that pdf-extractor, as far as I can see, uses another pdf.js as pdf2json ! But maybe you are already aware of that.

wanghaisheng commented 6 years ago

yes pdf2json use a very old version of pdf.js