metachris / pdfx

Extract text, metadata and references (pdf, url, doi, arxiv) from PDF. Optionally download all referenced PDFs.
http://www.metachris.com/pdfx
Apache License 2.0
1.03k stars 113 forks source link

PDFx won't see links in some PDFs #37

Open ghost opened 4 years ago

ghost commented 4 years ago

PDFx won't see most of the links in the PDF below. Is this a known issue? Is there a fix for it? Many thanks! https://webarchive.nationalarchives.gov.uk/20160613090753/https://www.litvinenkoinquiry.org/files/Litvinenko-Inquiry-Report-web-version.pdf

htInEdin commented 4 years ago

There are actually several problems with link processing in pdfx and pdfminer. I'm working on a pull request to address as many of them as I can, but in the interim the attached simple patch (against pdfx version 1.3.0) will fix the most serious one.

This patch increases the number of links recovered from the above linked file by harvesting Link annotations (as opposed to those scraped from the text) from 3 to 1067.

backends_patch.txt