metachris / pdfx

Extract text, metadata and references (pdf, url, doi, arxiv) from PDF. Optionally download all referenced PDFs.
http://www.metachris.com/pdfx
Apache License 2.0
1.03k stars 113 forks source link

URLs truncated at line endings #21

Open bitsgalore opened 7 years ago

bitsgalore commented 7 years ago

First of all: great tool! I did however come across a problem with URLs that span more than one line. I've attached a PDF that reproduces the problem here:

testpdfx.pdf

Command:

pdfx -v testpdfx.pdf -o testpdfx.txt

The URL in the footnote is extracted as::

http://jpylyzer.openpreservation.org//2016/01/06/Release-of-

Whereas this should be:

http://jpylyzer.openpreservation.org//2016/01/06/Release-of-jpylyzer-1-17-0

I used pdfx version 1.3.1 on Linux Mint.

aberja commented 7 years ago

Hi, I'm not sure if you are still working on this code. But on the chance that you are, I wanted to let you know that I also experience the same issue in pdfx v 1.3.1 that bitsgalore reported above.

Doubledimas commented 7 years ago

I would love to see a solution to this issue. It is one of two problems that is stopping me from using pdfx for my academic research.

markratledge commented 6 years ago

I see the same issue; reported good or 404 URLs are truncated at 20 characters when using the command format: pdfx testpdfx.pdf -c

sscirrus commented 5 years ago

Same issue here. Lots of URLs are ignored or treated as invalid because they cover multiple lines in a PDF (especially when the lines are narrow). Please fix - this is a critical issue preventing me from using pdfx!