metachris / pdfx

Extract text, metadata and references (pdf, url, doi, arxiv) from PDF. Optionally download all referenced PDFs.
http://www.metachris.com/pdfx
Apache License 2.0
1.03k stars 113 forks source link

PDF references should not be treated as such based on extension #30

Open theiostream opened 5 years ago

theiostream commented 5 years ago

PDF files pointed to by other PDF files need not have a .pdf extension to be identified as such. I had to apply the following patch to be able to download PDFs recursively (in my case, they had no extension):

diff --git a/pdfx/__init__.py b/pdfx/__init__.py
index 6042e26..8411235 100644
--- a/pdfx/__init__.py
+++ b/pdfx/__init__.py
@@ -194,7 +194,7 @@ class PDFx(object):
         logger.debug("- Saved metadata to '%s'" % fn_json)

         # Download references
-        urls = [ref.ref for ref in self.get_references("pdf")]
+        urls = [ref.ref for ref in self.get_references()]
         if not urls:
             return

Of course, this quick fix brings problems. pdfx will try (and fail) to download mailto: links, or will download random websites linked to. Point is: pdfx should allow some kind of custom regex or something to identify desirable files among references. Maybe it should also allow some a posteriori file checking (download a file, see if it's a PDF, if not, delete it).