useblocks / libpdf

Extract structured data from PDFs
MIT License
8 stars 2 forks source link

Color Information for Paragraphs #25

Closed kreuzberger closed 8 months ago

kreuzberger commented 9 months ago

By deeper paragraph analysis i want to get information about the paragraph background color, e.g. to check for the rendered code example. At which part in textbox.py could i get color information from the PDFObj "behind" the scenes or is this to late? Where could be a good point for it to get this information?

ubmarco commented 9 months ago

I guess this will be difficult with the PDF standard in general. There are single letters of a certain style (font, size, also color). And there is graphics such as rectangles, lines. A colored background box is not coupled to the characters. The only solution I see is finding paragraphs and their coordinates and matching it with graphics/figures on the same area.

kreuzberger commented 9 months ago

Could follow but,.....

Which tool for a ubuntu / linux platform could be best to analyze the generated pdf? Currently my pdfs are generated from html with weasyprint (https://github.com/kozea/weasyprint), maybe i look for configuration possiblities there too.

Storing the pdf as uncompressed pdfs does not realy help me more for debugging

ubmarco commented 8 months ago

I also want to know how background rectangles/figures are represented. However time is a scarce resource currently... In the meantime: If libpdf does not provide the necessary information, you could try the latest version of the underlying libraries pdfminer and pdfplumber. I will update you once I find the time to look into this.

ubmarco commented 8 months ago

Can you provide a minimal example of a PDF? Are you using this?

kreuzberger commented 8 months ago

My main intention is to check pdf's generated with weasyprint and sphinx-simplepdf from here: https://github.com/useblocks/sphinx-simplepdf/issues/83

Currently i am using the patched version i provided as pull request with the "current" libraries from pdfminer and pdfplumber. And i am testing with my project documents to evaluate the possiblity of WHAT could be tested and WHAT makes sense to be tested.

I will provide a simple example with the code example, but question is where to provide: in the sphinx-simplepdf project or here. Both makes sense. As maintainer is up to you to decide.

kreuzberger commented 8 months ago

I think one reason could also be the patched version i use has some errors in testing. Just executed the tests today.

platform linux -- Python 3.11.2, pytest-7.4.4, pluggy-1.3.0
rootdir: /src/github/libpdf
configfile: tox.ini
plugins: bdd-7.0.1
collected 22 items                                                                                                                                                                         

tests/test_api.py ...                                                                                                                                                                [ 13%]
tests/test_catalog.py .F.                                                                                                                                                            [ 27%]
tests/test_cli.py ..                                                                                                                                                                 [ 36%]
tests/test_details.py .                                                                                                                                                              [ 40%]
tests/test_ds93_chapter.py .                                                                                                                                                         [ 45%]
tests/test_figures.py FFF                                                                                                                                                            [ 59%]
tests/test_full_features.py .......                                                                                                                                                  [ 90%]
tests/test_import.py .                                                                                                                                                               [ 95%]
tests/test_tables.py .   

The test errors due to the ValueError in catalog extract seem to be identical to the errors in my pdf. I will investigate what could be the problem. May this could be also an error in one of the libraries, maybe something else. I will check it on the forked branch.

And the error in the figures could also be the reason why the box is not identified correctly. During processing of my pdf with no_annotations i got no errors, but with i got the same ValueErrors like in the test_catalog.py

kreuzberger commented 8 months ago

fixed the test_catalog.py failing tests by checking for valid bbox's before continue processing.

The figure tests are mysterious to me.

The first tests wants to check to extract only figures with valid bboxes, but the (one) figure in the pdf has (due to the file test name) an invalid bbox. So the height is 0 and therefore the figure is filtered (correctly) out of the figure list.

So maybe check if the tests itself are valid.

kreuzberger commented 8 months ago

after analysis of the pages with pdfplumber i see rect objects like this:

"rects": [
        {
          "x0": 56.25,
          "y0": 426.04943199999997,
          "x1": 539.02559025,
          "y1": 593.1397637499999,
          "width": 482.77559025000005,
          "height": 167.09033174999996,
          "pts": [
            [
              56.25,
              248.75000025000008
            ],
            [
              539.02559025,
              248.75000025000008
            ],
            [
              539.02559025,
              415.84033200000005
            ],
            [
              56.25,
              415.84033200000005
            ]
          ],
          "linewidth": 0,
          "stroke": 0,
          "fill": 1,
          "evenodd": 0,
          "stroking_color": null,
          "non_stroking_color": [
            0.858824,
            0.980392,
            0.956863
          ],
          "mcid": null,
          "tag": null,
          "object_type": "rect",
          "page_number": 4,
          "stroking_pattern": null,
          "non_stroking_pattern": null,
          "top": 248.75000025000008,
          "bottom": 415.84033200000005,
          "doctop": 248.75000025000008
        },

The non-stroking color seems to match exactly the RGB value defined for this background. So i try to find out where to get those rect objects

kreuzberger commented 8 months ago

I integrated a solution that extracts the "rects" from pdfplumber as separate type in libpdf like figures, tables etc. That means that the content of the "rects" is also removed from chapters/paragraphs like for tables and figures. Could this be a solution or should the "rects" from pdfplumber be mapped to the "figures" of libpdf? Both solution makes sense, but i think i would prefer mapping the pdfplumber types to a libpdf type "rect".

Also to clarify would be text extraction. Figure Text is removed from chapters/paragraphs. i think for rects we should

What do you think?

kreuzberger commented 8 months ago

Added the rects extraction as arguments to main, mapping rect objects as "own" type. no_rects: extraction of rects is disabled (default False) crop_rects_text: rects text is removed from chapters/paragraphs if true, else duplicated in rects/paragraph (default False)

all tests "run" without the test_figures tests. Assumption is here that the old pdfplumber handeld rects and figures different or the "rects" were differently mapped to "figures".

All changes are push to the PR #24 and got merged into #30

juiwenchen commented 8 months ago

Rect is introduced in the following PR. Credit for @kreuzberger

https://github.com/useblocks/libpdf/pull/30