useblocks / libpdf

Extract structured data from PDFs
MIT License
8 stars 2 forks source link

Adding tests for rect extractions #36

Closed kreuzberger closed 7 months ago

kreuzberger commented 8 months ago

Added tests for rect extraction for sphinx-simplepdf / weasyprint generated pdf. Tests checks for textbox extraction from codeblocks, admonitions and tables.

The tests for table did not work as expected. Instead of extracting colored table cells as rect, the 3 table row shown with alternating colors is extracted as whole. Attached is a picture from visual debug.

The tests now works, asuming the "wrong" number of rects (i would expect 7). See attached file. All other extractions work like expected. libpdf_rect_from_table

kreuzberger commented 8 months ago

The failing test has nothing to do with the test implemenation, the test is missing a required executable! This now explains why i had to patch /etc/ImageMagic Policies on my ubuntu machine.!

This seems to be a feature of the visual debug. A hint in the doc would help.

ubmarco commented 8 months ago

Thanks a lot for your PR, I really appreciate new tests for the library.

I cannot push to your branch as the fork is created on your organization, not on your personal account. So I added a commit to your branch and created a new PR from it to see the changes in CI https://github.com/useblocks/libpdf/pull/37. The ruff linting is non-voting for now, but I want to enable it over time for more and more files. Once a file is touched I will add it to the files in the tox.ini lint environment.

ubmarco commented 8 months ago

PR https://github.com/useblocks/libpdf/pull/37 fails as expected. I propose to cherry-pick my commits into your branch and fix the mentioned issues from my review.

ubmarco commented 8 months ago

For the rect count in your PDF, this is how the PDF is made, the header row actually has 3 rectangles while the row is just one. If you zoom in extremely, you also see it, e.g. here in Firefox's pdf.js based reader: image If you look closely, you see bright vertical lines in the header, but not in the row.

kreuzberger commented 7 months ago

almost there If i set visual_debug to True, the debugging is configured. This is my main intention.