jorisschellekens / borb

borb is a library for reading, creating and manipulating PDF files in python.
https://borbpdf.com/
Other
3.37k stars 148 forks source link

Why is `borb` so (incredibly) slow? #39

Closed sgraaf closed 2 years ago

sgraaf commented 2 years ago

Hi!

I am working on a PDF text mining project, for which I decided to benchmark & compare various Python PDF libraries for reading PDF files. For a random sample of 10 PDF files (270 pages, 17.5 MiB in total), I get the following results:

Summary statistics for the sample of 10 PDFs
File 1/10        30 pages       2.53 MiB
File 2/10        36 pages       2.55 MiB
File 3/10        19 pages       0.85 MiB
File 4/10        30 pages       1.89 MiB
File 5/10        20 pages       1.15 MiB
File 6/10        29 pages       1.89 MiB
File 7/10        32 pages       2.14 MiB
File 8/10        19 pages       0.85 MiB
File 9/10        19 pages       0.95 MiB
File 10/10       36 pages       2.75 MiB
Total size of all 10 PDF-files: 17.54 MiB

---------- Benchmarking pdfrw ----------
Reading PDF-file 1/10 took 0.006 seconds
Reading PDF-file 2/10 took 0.006 seconds
Reading PDF-file 3/10 took 0.004 seconds
Reading PDF-file 4/10 took 0.005 seconds
Reading PDF-file 5/10 took 0.004 seconds
Reading PDF-file 6/10 took 0.005 seconds
Reading PDF-file 7/10 took 0.006 seconds
Reading PDF-file 8/10 took 0.004 seconds
Reading PDF-file 9/10 took 0.003 seconds
Reading PDF-file 10/10 took 0.006 seconds

Reading all 10 PDF-files w/ `pdfrw` took 0.051 seconds

---------- Benchmarking PyPDF2 ----------
Reading PDF-file 1/10 took 0.005 seconds
Reading PDF-file 2/10 took 0.007 seconds
Reading PDF-file 3/10 took 0.003 seconds
Reading PDF-file 4/10 took 0.005 seconds
Reading PDF-file 5/10 took 0.004 seconds
Reading PDF-file 6/10 took 0.005 seconds
Reading PDF-file 7/10 took 0.006 seconds
Reading PDF-file 8/10 took 0.004 seconds
Reading PDF-file 9/10 took 0.003 seconds
Reading PDF-file 10/10 took 0.007 seconds

Reading all 10 PDF-files w/ `PyPDF2` took 0.050 seconds

--------- Benchmarking PyMuPDF ---------
Reading PDF-file 1/10 took 0.002 seconds
Reading PDF-file 2/10 took 0.001 seconds
Reading PDF-file 3/10 took 0.000 seconds
Reading PDF-file 4/10 took 0.002 seconds
Reading PDF-file 5/10 took 0.001 seconds
Reading PDF-file 6/10 took 0.001 seconds
Reading PDF-file 7/10 took 0.001 seconds
Reading PDF-file 8/10 took 0.001 seconds
Reading PDF-file 9/10 took 0.001 seconds
Reading PDF-file 10/10 took 0.001 seconds

Reading all 10 PDF-files w/ `PyMuPDF` took 0.011 seconds

------- Benchmarking pdfminer.six -------
Reading PDF-file 1/10 took 0.139 seconds
Reading PDF-file 2/10 took 0.151 seconds
Reading PDF-file 3/10 took 0.070 seconds
Reading PDF-file 4/10 took 0.127 seconds
Reading PDF-file 5/10 took 0.081 seconds
Reading PDF-file 6/10 took 0.123 seconds
Reading PDF-file 7/10 took 0.137 seconds
Reading PDF-file 8/10 took 0.069 seconds
Reading PDF-file 9/10 took 0.070 seconds
Reading PDF-file 10/10 took 0.152 seconds

Reading all 10 PDF-files w/ `pdfminer.six` took 1.118 seconds

-------- Benchmarking pdfplumber --------
Reading PDF-file 1/10 took 0.152 seconds
Reading PDF-file 2/10 took 0.169 seconds
Reading PDF-file 3/10 took 0.078 seconds
Reading PDF-file 4/10 took 0.144 seconds
Reading PDF-file 5/10 took 0.088 seconds
Reading PDF-file 6/10 took 0.138 seconds
Reading PDF-file 7/10 took 0.148 seconds
Reading PDF-file 8/10 took 0.078 seconds
Reading PDF-file 9/10 took 0.081 seconds
Reading PDF-file 10/10 took 0.170 seconds

Reading all 10 PDF-files w/ `pdfplumber` took 1.247 seconds

--------- Benchmarking pikepdf ---------
Reading PDF-file 1/10 took 0.022 seconds
Reading PDF-file 2/10 took 0.025 seconds
Reading PDF-file 3/10 took 0.014 seconds
Reading PDF-file 4/10 took 0.023 seconds
Reading PDF-file 5/10 took 0.026 seconds
Reading PDF-file 6/10 took 0.021 seconds
Reading PDF-file 7/10 took 0.023 seconds
Reading PDF-file 8/10 took 0.015 seconds
Reading PDF-file 9/10 took 0.013 seconds
Reading PDF-file 10/10 took 0.025 seconds

Reading all 10 PDF-files w/ `pikepdf` took 0.207 seconds

----------- Benchmarking tika -----------
Reading PDF-file 1/10 took 1.263 seconds
Reading PDF-file 2/10 took 1.467 seconds
Reading PDF-file 3/10 took 1.286 seconds
Reading PDF-file 4/10 took 1.242 seconds
Reading PDF-file 5/10 took 1.887 seconds
Reading PDF-file 6/10 took 1.117 seconds
Reading PDF-file 7/10 took 1.274 seconds
Reading PDF-file 8/10 took 1.289 seconds
Reading PDF-file 9/10 took 1.418 seconds
Reading PDF-file 10/10 took 1.402 seconds

Reading all 10 PDF-files w/ `tika` took 13.645 seconds

----------- Benchmarking borb -----------
Reading PDF-file 1/10 took 273.480 seconds
Reading PDF-file 2/10 took 322.414 seconds
Reading PDF-file 3/10 took 298.064 seconds
Reading PDF-file 4/10 took 275.535 seconds
Reading PDF-file 5/10 took 411.225 seconds
Reading PDF-file 6/10 took 246.551 seconds
Reading PDF-file 7/10 took 269.851 seconds
Reading PDF-file 8/10 took 292.939 seconds
Reading PDF-file 9/10 took 318.867 seconds
Reading PDF-file 10/10 took 318.921 seconds

Reading all 10 PDF-files w/ `borb` took 3027.847 seconds

Even compared to tika, which makes calls to a RESTful API, borb is 200+ times slower. Compared to the fastest "Pure Python" library in this little benchmarking test (PyPDF2), borb is 600k+ times slower.

I really like borb's API: I find it to be very intuitive and Pythonic. As such, I would love to use it in this project and similar. So I guess my question is: what gives?

jorisschellekens commented 2 years ago

Thanks for your feedback. I'd love to know what takes up most of the time. If you can dig deeper and perhaps perform a trace to see which functions are the most labor-intensive, that'd be great.

Kind regards, Joris Schellekens

sgraaf commented 2 years ago

Thanks for your response.

For this little benchmark, for every PDF library, the goal was to see how long it would take to get a simple page count. For borb, I benchmarked the following code:

def get_page_count_borb(file: Path) -> int:
    with open(file, "rb") as f: 
        doc = PDF.loads(f)
        return doc.get_document_info().get_number_of_pages()

I'll see if I can narrow it down further soon.

jorisschellekens commented 2 years ago

I think one of the first things to remark there is that for borb, there is no difference between getting a page-count and getting the text from all pages.

The entire binary stream is converted to the internal representation when you're loading a Document.

I imagine some libraries might optimize that, and only parse the needed things to get the page-count.

jorisschellekens commented 2 years ago

I changed the code around a bit. Rather than always parsing the content of the page, a page is now only parsed if there is actually a registered EventListener. In other words, if the content of the page is not needed by anyone, it isn't parsed.

This still allows you to open / copy / modify documents. And of course to read their metadata (such as number of pages). This should provide a significant speed-up.

These are my findings. The corpus I used can be found here.

output.pdf

Kind regards, Joris Schellekens