metachris / pdfx

Extract text, metadata and references (pdf, url, doi, arxiv) from PDF. Optionally download all referenced PDFs.
http://www.metachris.com/pdfx
Apache License 2.0
1.03k stars 113 forks source link

PDFx is storing prior parsed PDFs causing incorrect references / annotations to be found #14

Closed scottwernervt closed 8 years ago

scottwernervt commented 8 years ago

Doc1.pdf Doc2.pdf

Parsing annotations with get_references() on multiple files will cause annotations from all prior parsed PDFs to appear in the current one.

PDF 1: Correct

from pdfx import PDFx
pdf_1 = PDFx('Doc1.pdf')
print([url.ref for url in pdf_1.get_references()])
# >> ['http://www.google.com/', 'google.com']

PDF 2: Correct

from pdfx import PDFx
pdf_2 = PDFx('Doc2.pdf')
print([url.ref for url in pdf_2.get_references()])
# >> ['bing.com', 'http://www.bing.com/']

PDF1 and PDF2 Together: Bug - PDF2 has annotations from PDF1

# -*- coding: utf-8 -*-
from pdfx import PDFx
pdf_1 = PDFx('Doc1.pdf')
print([url.ref for url in pdf_1.get_references()])
# >> ['google.com', 'http://www.google.com/']
pdf_2 = PDFx('Doc2.pdf')
print([url.ref for url in pdf_2.get_references()])
# >> ['http://www.google.com/', 'bing.com', 'google.com', 'http://www.bing.com/']
metachris commented 8 years ago

Thanks for reporting! sorry it took a little while. Fixed now! 🚀