metachris / pdfx

Extract text, metadata and references (pdf, url, doi, arxiv) from PDF. Optionally download all referenced PDFs.
http://www.metachris.com/pdfx
Apache License 2.0
1.03k stars 113 forks source link

"URI" in PDF attributes may be a string itself #31

Open theiostream opened 5 years ago

theiostream commented 5 years ago

The URI value in an attribute object may be itself a string, instead of a PDFObjRef. Not dealing with this case would cause many URIs to be ignored. The following patch fixed the issue for me, but a better solution may be desirable:

@@ -282,16 +279,22 @@ class PDFMinerBackend(ReaderBackend):
         if isinstance(obj_resolved, list):
             return [self.resolve_PDFObjRef(o) for o in obj_resolved]

+        print(obj_resolved)
         if "URI" in obj_resolved:
             if isinstance(obj_resolved["URI"], PDFObjRef):
                 return self.resolve_PDFObjRef(obj_resolved["URI"])
+            elif isinstance(obj_resolved["URI"], (str, unicode)):
+               if IS_PY2:
+                   ref = obj_resolved["URI"].decode("utf-8")
+               else:
+                   ref = obj_resolved
+               return Reference(ref, self.curpage)
morriscode commented 5 years ago

Thanks!