[documents] Benchmark PDF document reading + numpy conversion options

fg-mindee commented 3 years ago

Currently, the core reading of PDF document is made with PyMuPDF. This needs to be benchmarked against alternatives to ensure we use the optimal backend here.

charlesmindee commented 3 years ago

PDF reader benchmark: Some librairies are old and not well-maintained, such as PyPDF2. Here are the principal maintained potential alternatives:

PyMuPDF (based on MuPDF, a lightweight software lib in C)
pikepdf (based on QPDF, C++ powerful lib)
pdfminer.six (pure python package)
PyPDF3/4 (improvement of PyPDF2, pure python package)
python-poppler (based on xpdf-3.0, software tool)

python-poppler has many options: loading pages/documents; extracting text with bounding boxes, getting font information. This package requires the installation of poppler, a linux software based on xpdf-3.0 which has 29 dependencies. For Ubuntu, libpoppler-cpp-dev is also required to compile cpp files. This benchmark shows that Poppler is by far less optimized than MuPDF. This benchmark concludes: "Poppler uses Cairo to save result to image files. I don’t know which is the bottleneck: Poppler iteself or Cairo. Also notice that MuPDF has its own graphics library Fitz. On the other hand, MuPDF is not able to render to other formats other than image. Poppler, with cairo as backend, supports more output formats. Therefore, if you just just want a lightweight and fast PDF viewer, I think you should consider MuPDF. If you want more features, it’s better to choose Poppler."
This benchmark clearly highlights that for pdf parsing, PyMuPDF largely outstands PyPDF2 (and consequently PyPDF3/4 which are based on PyPDF2), pdftk and pdfrw (2 old deprecated tools). For text extraction, the same benchmark shows that PyMuPDF is much faster than pdfminer, poppler. For image rendering task, PyMuPDF is also much faster than Xpdf (poppler).
pikepdf seems to be pretty fast, but doesn't deal with text extraction at all. The documentation recommends to use pdfminer.six, which is by far slower than PyMuPDF for this task.

As a conclusion:

pikepdf cannot do text extraction
pdfminer.six is very slow (pure python)
PyPDF3/4 are pure-python tool, slow and which cannot extract text
python-poppler seems to have many features but is also slower than PyMuPDF and requires a software (poppler)
PyMuPDF seems by far the fastest way to read and extract text from pdf, and you have many options: read/convert, extract text and images, render, access meta information for all supported document types, not just pdf (“.pdf”, “.xps”, “.oxps”, “.cbz”, “.fb2” or “.epub” + 10 popular image formats), and specifically for pdf you have much more features (reposition images, extract fonts, merge, decrypt protected pdf, ...)

@fg-mindee

fg-mindee commented 3 years ago

Let's stick with PyMuPDF for now then and hope that we somehow manage to get around #113 later on!

MartinThoma commented 2 years ago

Just a short heads-up (I see that this issue got closed a while ago; people might still find it via Google):

PyPDF2 is maintained again since April 2022. I'm the new maintainer and there are at least two other people who are pretty active. We are currently preparing PyPDF2 2.0.0 and will work on text extraction improvements afterwards.
I've set up a PDF text extraction benchmark that might be interesting to you

frgfm commented 2 years ago

Hi @MartinThoma 👋

Thanks for letting us know! Could you specify which license will be used for the upcoming refactor?

For the sake of documentation, in #486, we considered another recent option: pypdfium. We should do a full benchmark for performances but the license is compatible with all OSS projects and the support is great so far :)

MartinThoma commented 2 years ago

PyPDF2 stays with BSD (3-Clause).

Nice, I didn't know pypdfium. If you let me know how it extracts text from a PDF, I'll add it to the benchmark :-)

frgfm commented 2 years ago

PyPDF2 stays with BSD (3-Clause).

Nice, I didn't know pypdfium. If you let me know how it extracts text from a PDF, I'll add it to the benchmark :-)

Thanks to the amazing @mara004, you can find it here :)

mara004 commented 2 years ago

Thanks for the compliment ;). FYI, I'm currently working on a full-scale API rewrite to fix some annoyances, so probably it would make sense to await this being merged before you implement something new with pypdfium2. That said, while testing, I found out pypdfium2 is currently unable to extract text in special writing systems like hindi. I'm not sure if this is caused by PDFium itself or the way I'm decoding the data. If doctr needs to support this and you don't mind an additional dependency, then you might want to consider one of the alternatives shown in the benchmark project.

MartinThoma commented 2 years ago

Thanks to the amazing @mara004, you can find it https://github.com/mindee/doctr/pull/829#issuecomment-1133983339 :)

Something like this?

def get_pdfium_text(filepath: str) -> str:
    text = ""
    doc = pdfium.PdfDocument(filepath)
    for page_num in len(doc):
        textpage = doc.get_textpage(page_num)
        text += textpage.get_text()
    return text

pypdfium2 is currently unable to extract text in special writing systems like hindi.

Do you have a sample PDF? I'm always interested in extending PyPDF2 test cases / the benchmarks :smile:

mara004 commented 2 years ago

Something like this?

For the old API, yes. Perhaps you'll yet want to insert a newline character after each page, and call doc.close() at the end. With the new API, getting the text page would work a little different, though.

Do you have a sample PDF? I'm always interested in extending PyPDF2 test cases / the benchmarks smile

Sure, a sample document is attached here (generated by pypdfium2's test suite).

MartinThoma commented 2 years ago

I've added PDFium to the text extraction benchmarks: https://github.com/py-pdf/benchmarks

The gist of it:

Speed is CRAZY! Really well done! PDFium shares the first place with PyMuPDF
Quality is top-notch. Similar to Tika / PyMuPDF.

The quality is so good that I'm now going over the differences and see if I need to adjust the ground truth. The scores might change a bit today (in favor of PDFium). What I notice so far:

Spacing (newlines) seem to be worse than Tika
PDFium uses weird dashes (hyphens, \ufffe to be exact) when line-breaking dashes are used
PDFium sometimes misses complete blocks of text, e.g. for links (footnotes?)

mara004 commented 2 years ago

Thanks for the benchmark! I'm happy that someone is looking into the text extraction feature more thoroughly, because I don't personally use it in my projects yet. For the problems you mentioned, can you please point me at the files in question? Perhaps we can ask upstream about it.

mara004 commented 2 years ago

Looking at the quality results, I see pypdfium2 is almost equal to pymupdf for most documents, except for sample 13, where it only has 64% coverage. After downloading the document and running pypdfium2 extract-text, I can see why. There's a decoding error one page 3:

Traceback (most recent call last):
  File "/home/mara/.local/bin/pypdfium2", line 8, in <module>
    sys.exit(main())
  File "/home/mara/.local/lib/python3.8/site-packages/pypdfium2/_cli/main.py", line 71, in main
    Subcommands[args.subcommand].main(args)
  File "/home/mara/.local/lib/python3.8/site-packages/pypdfium2/_cli/extract_text.py", line 41, in main
    text = textpage.get_text()
  File "/home/mara/.local/lib/python3.8/site-packages/pypdfium2/_helpers/textpage.py", line 55, in get_text
    text = bytes(c_array).decode("utf-16-le")[:-1]
  File "/usr/lib/python3.8/encodings/utf_16_le.py", line 16, in decode
    return codecs.utf_16_le_decode(input, errors, True)
UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 3446-3447: illegal UTF-16 surrogate

mara004 commented 2 years ago

I was able to fix the issue by adding errors="ignore" to the decode() call and will push a commit to the rewrite branch. It may be that the document in question is not perfectly valid?

MartinThoma commented 2 years ago

It may be that the document in question is not perfectly valid?

Yes, that is very likely! I think that was also the document PyPDF2 struggled with. I still want to have it in the benchmark as invalid PDF documents are sadly pretty common.

MartinThoma commented 2 years ago

PDFium sometimes misses complete blocks of text, e.g. for links (footnotes?)

About that part: I actually like the behavior of PDFium better than Tika. It's about hyperlinks in the document. Tika adds them to the bottom of the extract, PDFium skips them. I think they should be skipped. I'm adjusting the ground truth.

mara004 commented 2 years ago

FYI, I just made a bugfix release containing the errors="ignore" change, so you may re-run the benchmark if you like.

MartinThoma commented 2 years ago

Very nice! Well done!

The latest benchmark results show that PDFium is now a little bit better than PyMuPDF in extracting texts from English/German documents (changes of the extraced text). It is still behind Tika, but not by much.

The main part that changed:

PyPDF2 has also improved, but still is noticably behind Tika/PyMuPDF/PDFium. We will get closer again with the 2.1.0 release (expected end of June) :-)

MartinThoma commented 2 years ago

I'm sorry for the doctr folks that we hijacked this issue :sweat_smile:

I was actually thinking if a meta-package would be useful. Similar as matplotlib allows you to choose different backends, you could do something similar for processing PDF documents. PyPDF2 could be a reasonable fallback if using Java / C++ or some of the licenses is not acceptable, but if it is acceptable, you could use a faster backend like PyMuPDF (I'm actually not sure about what PDFium uses under the hood ... I guess C++? I also don't know which licenses PDFium and its dependencies have)

frgfm commented 2 years ago

Oh please don't @MartinThoma :) I have to say: I'm thrilled to see that within 48 hours, we got a new benchmark, people discovering a viable option for PDF parsing and having you two in touch 😁

That's exactly the reason why we document this type of comparison on issues like this one! About licensing, that was one of the reason we checked pypdfium, it's compatible with Apache licenses :)

MartinThoma commented 2 years ago

Is the BSD 3-clause license not compatible with Apache?

Personally, I don't care too much about the license. For most stuff I set MIT because it's easiest to read. For my public stuff I have the attitude: "Do whatever you want with it, but (1) don't sue me if things break (2) don't claim that I endorse your project if I didn't (3) if my software solves a core problem of your software, I would appreciate if you give credit - not strictly required, but appreciated / seems fair".

However, as PyPDF2 is already a bit older and over 100 people contributed to it, I'm uncertain what it would mean to change the license / add a new license. I don't know whom I would need to ask for permission. In the worst case, all contributors... which would be infeasable, as I will not be able to reach all of them.

mara004 commented 2 years ago

(I'm actually not sure about what PDFium uses under the hood ... I guess C++? I also don't know which licenses PDFium and its dependencies have)

Under the hood it's C++17 indeed. The public headers, however, are C only (luckily, otherwise it wouldn't be possible to use ctypes for the bindings). Concerning licenses, PDFium itself is BSD-3-Clause or Apache-2.0, at the embedder's choice. For its dependencies, a variety of non-copyleft licenses apply, which should be listed here (I hope I didn't miss one).

Is the BSD 3-clause license not compatible with Apache?

I'm not a lawyer, but as far as I'm aware, they are perfectly compatible.

I'm uncertain what it would mean to change the license / add a new license. I don't know whom I would need to ask for permission. In the worst case, all contributors... which would be infeasable, as I will not be able to reach all of them.

I think changing the licensing of pypdf2 is neither necessary nor feasible, as you would indeed need the written agreement of all contributors. In any case, there's nothing wrong with BSD-3-Clause, is there?

MartinThoma commented 2 years ago

@mara004 Congratulations! PDFium now is on first place! https://github.com/py-pdf/benchmarks - I've decided that text extraction should NOT add the target of a link - only the text of the link. That changed the order a bit (but I need to check if Tika has ways to customize its extraction format). Either way, PDFium does a great job

frgfm commented 2 years ago

Is the BSD 3-clause license not compatible with Apache?

It is actually, lesser known I guess because it's a bit outside of the MIT/Apache/GPL trio, but they are compatible :) But you are both correct in the sense that, if one of those libraries wrongly select the license (a dependency that has a non-compatible license), it can become a mess :/

Glad to read about the good perf on your benchmark :)

mindee / doctr

[documents] Benchmark PDF document reading + numpy conversion options #23