Closed sgraaf closed 2 years ago
Thanks for your feedback. I'd love to know what takes up most of the time. If you can dig deeper and perhaps perform a trace to see which functions are the most labor-intensive, that'd be great.
Kind regards, Joris Schellekens
Thanks for your response.
For this little benchmark, for every PDF library, the goal was to see how long it would take to get a simple page count. For borb
, I benchmarked the following code:
def get_page_count_borb(file: Path) -> int:
with open(file, "rb") as f:
doc = PDF.loads(f)
return doc.get_document_info().get_number_of_pages()
I'll see if I can narrow it down further soon.
I think one of the first things to remark there is that for borb
, there is no difference between getting a page-count and getting the text from all pages.
The entire binary stream is converted to the internal representation when you're loading a Document
.
I imagine some libraries might optimize that, and only parse the needed things to get the page-count.
I changed the code around a bit.
Rather than always parsing the content of the page, a page is now only parsed if there is actually a registered EventListener
. In other words, if the content of the page is not needed by anyone, it isn't parsed.
This still allows you to open / copy / modify documents. And of course to read their metadata (such as number of pages). This should provide a significant speed-up.
These are my findings. The corpus I used can be found here.
Kind regards, Joris Schellekens
Hi!
I am working on a PDF text mining project, for which I decided to benchmark & compare various Python PDF libraries for reading PDF files. For a random sample of 10 PDF files (270 pages, 17.5 MiB in total), I get the following results:
Even compared to
tika
, which makes calls to a RESTful API,borb
is 200+ times slower. Compared to the fastest "Pure Python" library in this little benchmarking test (PyPDF2
),borb
is 600k+ times slower.I really like
borb
's API: I find it to be very intuitive and Pythonic. As such, I would love to use it in this project and similar. So I guess my question is: what gives?