microsoft / OCR-Form-Tools

A set of tools to use in Microsoft Azure Form Recognizer and OCR services.
MIT License
507 stars 170 forks source link

pdf_renderer: use pypdfium2 rather than deprecated pypdfium #1010

Closed mara004 closed 6 months ago

mara004 commented 2 years ago

Hello,

I'm a former maintainer of pypdfium and now co-author of pypdfium2. I noticed that this project is using pypdfium to rasterise PDFs, but it is now deprecated and succeeded by pypdfium2. We have applied several modernisations like platform specific wheel builds, automatic pdfium init/deinit calls and a small, pythonic support model API to facilitate rendering PDFs. pypdfium2 will be updated on a regular basis, while no further releases are planned for pypdfium.

This patch modifies utils/pdf_renderer.py to use pypdfium2, with the new support model API. If you wish to keep using the raw PDFium API, this is still possible, too.

https://github.com/pypdfium2-team/pypdfium2 https://pypi.org/project/pypdfium2/

ghost commented 2 years ago

CLA assistant check
All CLA requirements met.

mara004 commented 1 year ago

I just updated this PR to include the newer preprocessor/pdf_renderer.py, but you should really change your API to de-duplicate the code and load the document only once. It doesn't make sense at all to re-load the document in a separate method just to get page count. You may also want to take a look at pypdfium2's documentation; it provides a multi-page renderer with concurrency that may be more suitable for your use case.

mara004 commented 1 year ago

Pipfile and requirements.txt still need to be updated properly, but I'm not familiar with this form of dependency pinning. Maybe a project member can finalise this?

mara004 commented 1 year ago

@cschenio @buddhawang

buddhawang commented 1 year ago

@cschenio can you take a look? thanks!

cschenio commented 1 year ago

@mara004 thank you for revisiting this, let's see if I can de-dup the PDF loading logic.

mara004 commented 1 year ago

Thanks for the response! I'll need to update this PR again. It's quite some time ago that I initially submitted this, and a few things seem outdated now.

mara004 commented 1 year ago

I force-pushed a commit that, I hope, nicely restructures rendering. I ran the test suite, which seems to work. Note that I had to replace the expected result because pypdfium2 uses RGB rather than RGBA where possible.

However, it looks like preprocess_multi_page_bundle() is currently not covered by tests, and I'm not sure how to invoke that function. Could you please check it still works as expected?

mara004 commented 1 year ago

I think this is ready for review again.

cschenio commented 1 year ago

I think this is ready for review again.

Good to know that, I will take on it lately.

mara004 commented 1 year ago

FYI, I am yet planning to release a new major version that will change the rendering API a bit. This will take some time. I plan to update the patch set when pypdfium2 v4 is released.

mara004 commented 1 year ago

Coming back to this, I think the rewrite will still take quite some time, so you could also review/merge this before v4 is released and we can then update your code later in a following PR (which will be much smaller than this one).