metachris / pdfx

Extract text, metadata and references (pdf, url, doi, arxiv) from PDF. Optionally download all referenced PDFs.

http://www.metachris.com/pdfx

Apache License 2.0

1.03k stars 113 forks source link

Embedded URLs not being picked up? #19

Open gcoladon opened 7 years ago

gcoladon commented 7 years ago

Take a look at this PDF: http://mountainview.gov/civicax/filebank/blobdload.aspx?BlobID=20591

There are 11 documents linked to on the second page, but pdfx doesn't seem to notice them:

pdfx -v http://mountainview.gov/civicax/filebank/blobdload.aspx?BlobID=20591 Document infos:

Creator = Crystal Reports
Pages = 4
Producer = Powered By Crystal
Title = Agenda and Notice

References: 1

URL: 1

URL References:

- www.mountainview.gov

Am I doing something wrong, or would pdfx need to be changed to detect links like these?