Extracting meta-information from existing PDFs

lpozo commented 2 years ago

Overview of the Issue: Extracting meta-information of existing PDFs takes too long and eventually raises a RecursionError. The worst issue is that most of the time the only metadata we get is None, even when a regular PDF reader correctly shows the meta-info.
Use case: My use case consists of processing a list of PDF files in a directory and getting their meta information for further processing.
borb Version(s): Version: 2.0.18
Python version: 3.10
Operating System: Ubuntu Linux 20.04
Reproduce the Error: To reproduce the issue, I have a folder with a bunch of random PDF files, mostly books with several pages. I'm running this code snippet over them:

from pathlib import Path
from typing import List, Union

from borb.pdf.pdf import PDF

def get_pdf_list(src_dir: Union[Path, str]) -> List[Path]:
    if isinstance(src_dir, str):
        src_dir = Path(src_dir)
    return [path for path in src_dir.rglob("*") if path.is_file()]

def main():
    for pdf in get_pdf_list("books/"):
        with pdf.open("rb") as in_file_handle:
            doc = PDF.loads(in_file_handle)
        print(doc.get_document_info().get_author())

if __name__ == "__main__":
    main()

This code generates a really long traceback with a RecursionError pointing out to Dictionary.add_base_methods(). When the target PDF gets successfully read, I get None as the author's info.

Doing a similar operation with PyPDF4 takes milliseconds. However, apparently, this library isn't actively maintained.

I've noticed that reading PDF files created with borb works correctly and faster. But, I assume that most of the time we work with PDFs created by other tools.

jorisschellekens commented 2 years ago

Hi there,

Can you please attach an example pdf that triggers the RecursionError?

I have a feeling this may be the real underlying problem.

Kind regards, Joris Schellekens

lpozo commented 2 years ago

Ohh, unfortunately, I don't think I can share the PDF legally, according to the copyright. It's a book by Manning Publications called: The quick Python book. Second Ed. Here are some of its properties:

jorisschellekens commented 2 years ago

The proporties itself don't really tell me much about why the problem is occurring.

I already have a test in my test-suite that attempts to read the meta-information of more than 1000 pdf documents.

My test-repository can be found here: https://github.com/jorisschellekens/pdf-corpus

So I'm pretty confident borb can actually do this. That's why I'd like your exact document. To see how it's different from the documents I'm already testing against.

Perhaps you can find a similar, non-copyrighted work?

Kind regards, Joris Schellekens

jorisschellekens commented 2 years ago

Hi there,

This issue has been open for a week now. If you can not provide me with an input document that reproduces the problem, then I can't help you.

I am going to close this ticket as "can not reproduce". If at some point you do find a document that you can share, you are welcome to re-open the ticket.

Kind regards, Joris Schellekens

jorisschellekens / borb

Extracting meta-information from existing PDFs #70