jorisschellekens / borb

borb is a library for reading, creating and manipulating PDF files in python.
https://borbpdf.com/
Other
3.39k stars 147 forks source link

Extracting meta-information from existing PDFs #70

Closed lpozo closed 2 years ago

lpozo commented 2 years ago
from pathlib import Path
from typing import List, Union

from borb.pdf.pdf import PDF

def get_pdf_list(src_dir: Union[Path, str]) -> List[Path]:
    if isinstance(src_dir, str):
        src_dir = Path(src_dir)
    return [path for path in src_dir.rglob("*") if path.is_file()]

def main():
    for pdf in get_pdf_list("books/"):
        with pdf.open("rb") as in_file_handle:
            doc = PDF.loads(in_file_handle)
        print(doc.get_document_info().get_author())

if __name__ == "__main__":
    main()

This code generates a really long traceback with a RecursionError pointing out to Dictionary.add_base_methods(). When the target PDF gets successfully read, I get None as the author's info.

Doing a similar operation with PyPDF4 takes milliseconds. However, apparently, this library isn't actively maintained.

I've noticed that reading PDF files created with borb works correctly and faster. But, I assume that most of the time we work with PDFs created by other tools.

jorisschellekens commented 2 years ago

Hi there,

Can you please attach an example pdf that triggers the RecursionError?

I have a feeling this may be the real underlying problem.

Kind regards, Joris Schellekens

lpozo commented 2 years ago

Ohh, unfortunately, I don't think I can share the PDF legally, according to the copyright. It's a book by Manning Publications called: The quick Python book. Second Ed. Here are some of its properties:

image

jorisschellekens commented 2 years ago

The proporties itself don't really tell me much about why the problem is occurring.

I already have a test in my test-suite that attempts to read the meta-information of more than 1000 pdf documents.

My test-repository can be found here: https://github.com/jorisschellekens/pdf-corpus

So I'm pretty confident borb can actually do this. That's why I'd like your exact document. To see how it's different from the documents I'm already testing against.

Perhaps you can find a similar, non-copyrighted work?

Kind regards, Joris Schellekens

jorisschellekens commented 2 years ago

Hi there,

This issue has been open for a week now. If you can not provide me with an input document that reproduces the problem, then I can't help you.

I am going to close this ticket as "can not reproduce". If at some point you do find a document that you can share, you are welcome to re-open the ticket.

Kind regards, Joris Schellekens