useblocks / libpdf

Extract structured data from PDFs
MIT License
8 stars 2 forks source link

Rect model #30

Closed juiwenchen closed 8 months ago

juiwenchen commented 8 months ago

@kreuzberger I cherry picked your branch upgrade to this PR in order to make the change atomic. The credit for this rect feature is for you.

https://github.com/useblocks/libpdf/issues/25

kreuzberger commented 8 months ago

Hi! Thanks for integration!. If found a small issue in core.py

LOG.info('Text rects crop: %s', 'no' if crop_rects_text else 'no') The first one should be a "yes"

But only affects logging, so very minor

ubmarco commented 8 months ago

Please rebase to check the new format-check tox env.

juiwenchen commented 8 months ago

Hi! Thanks for integration!. If found a small issue in core.py

LOG.info('Text rects crop: %s', 'no' if crop_rects_text else 'no') The first one should be a "yes"

But only affects logging, so very minor

@kreuzberger Sorry, I didn't pay attention to the flag crop_rects_text. In our internal discussion, we would still like to consider the text in the coverage of rects as paragraphs or chapters. The reason is that chapters and paragraphs may also have the background color. I have adapted this concept in this RP, so rects are not excluded in chapter and paragraphs extraction. What do you think?

kreuzberger commented 8 months ago

Hi! Thanks for integration!. If found a small issue in core.py LOG.info('Text rects crop: %s', 'no' if crop_rects_text else 'no') The first one should be a "yes" But only affects logging, so very minor

@kreuzberger Sorry, I didn't pay attention to the flag crop_rects_text. In our internal discussion, we would still like to consider the text in the coverage of rects as paragraphs or chapters. The reason is that chapters and paragraphs may also have the background color. I have adapted this concept in this RP, so rects are not excluded in chapter and paragraphs extraction. What do you think?

@juiwenchen I think this is a good concept, rects should not be excluded in chapter and paragrapsh by default. This option was added cause the old figure handling behaved different (figure text was removed from paragraphs and chapters) and i wanted it therefore as an option. So this options could also be removed (crop_rects_text).

juiwenchen commented 8 months ago

Hi! Thanks for integration!. If found a small issue in core.py LOG.info('Text rects crop: %s', 'no' if crop_rects_text else 'no') The first one should be a "yes" But only affects logging, so very minor

@kreuzberger Sorry, I didn't pay attention to the flag crop_rects_text. In our internal discussion, we would still like to consider the text in the coverage of rects as paragraphs or chapters. The reason is that chapters and paragraphs may also have the background color. I have adapted this concept in this RP, so rects are not excluded in chapter and paragraphs extraction. What do you think?

@juiwenchen I think this is a good concept, rects should not be excluded in chapter and paragrapsh by default. This option was added cause the old figure handling behaved different (figure text was removed from paragraphs and chapters) and i wanted it therefore as an option. So this options could also be removed (crop_rects_text).

@kreuzberger I finalized the PR. Apart from the above-mentioned change, I adjusted the rect model that only one textbox at maximum can be in the rect. In this case, only the text covered in the rect is extracted to a newly instantiated textbox as I don't know what is the best way to address the lt_textboxes which are overflowed the rect, so it is the simplest solution from our side. What do you think?

The following is to summarize the changes in this PR based on your commit. If you are happy with this PR, do you mind running it against your test case and let us know if we should merge it.

kreuzberger commented 8 months ago

Hi! I tried to test the branch, and it failed. It seems that the sphinx-simplepdf files itself will fail in the initial parse. This was fixed in the original pr

pdf = <pdfplumber.pdf.PDF object at 0x7fb9e9f3a350>

    def get_named_destination(pdf):  # pylint: disable=too-many-branches
        """
        Extract Name destination catalog.

        Extracts Name destination catalog (link target) from pdf.doc.catalog['Name'] to obtain
        the coordinates (x,y) and page for the corresponding destination's name.

        PDFPlumber does not provide explict 'Named Destinations of Document Catalog' like py2pdf, so it needs to be obtained
        by resolving the hierarchical indirect objects.

        The first step in this function is to check if the name destination exist in the PDF. If it does not, no extraction
        is executed.

        :param pdf: pdf object of pdfplumber.pdf.PDF
        :return: named destination dictionary mapping reference of destination by name object
        """
        LOG.info("Catalog extraction: name destination ...")

        # check if name tree exist in catalog and extract name tree
        name_tree = {}
        named_destination = {}
        pdf_catalog = pdf.doc.catalog
        if "Names" in pdf_catalog:
            # PDF 1.2
            if (
                isinstance(pdf_catalog["Names"], PDFObjRef)
                and "Dests" in pdf_catalog["Names"].resolve()
            ):
                name_tree = pdf_catalog["Names"].resolve()["Dests"].resolve()
            elif isinstance(pdf_catalog["Names"], dict) and "Dests" in pdf_catalog["Names"]:
>               name_tree = pdf_catalog["Names"]["Dests"].resolve()
E               AttributeError: 'dict' object has no attribute 'resolve

The orignal code in the PR was:

   if 'Names' in pdf_catalog:
        # PDF 1.2
        if isinstance(pdf_catalog['Names'], PDFObjRef) and 'Dests' in pdf_catalog['Names'].resolve():
            name_tree = pdf_catalog['Names'].resolve()['Dests'].resolve()
        elif isinstance(pdf_catalog['Names'], dict) and 'Dests' in pdf_catalog['Names']:
            name_tree = resolve1(pdf_catalog['Names']['Dests'])
            # name_tree = pdf_catalog['Names']['Dests'].resolve()
            # LOG.debug(f"{name_tree}")

@juiwenchen Suggestion: we add a test for it in this repository on this branch? test_rects_extraction.pdf

by adding tests/test_rects.py and test this suggested pdf?

And how can i support this?

juiwenchen commented 8 months ago

PDF 1.2

@kreuzberger Interesting. I will have a quick fix from your original PR, and then we can merge this PR first. Afterwards, you can create a PR to add the test case for test_rects_extraction.pdf. What do you think

kreuzberger commented 8 months ago

i would suggest to add a testcase before merge to ensure the file could be opened Could be simple. Here are the file contents:

"""Test rects extraction."""
from click.testing import CliRunner

import libpdf
from tests.conftest import (
    PDF_RECTS_EXTRACTION,
)

def test_rects_extraction():
    objects = libpdf.load(PDF_RECTS_EXTRACTION)
    assert objects.flattened.rects is not None

I would then add tests on a new PR. But also ok if i do it all later.

While running the tests i got severe problems executing them:

wand.exceptions.PolicyError: attempt to perform an operation not allowed by the security policy

I have to patch as superuser my etc/configs to get it running! See https://bugs.archlinux.org/task/60580 Was this due to changes in the code / libraries used? I didnt run into those problems before...

kreuzberger commented 8 months ago

@juiwenchen ok, after latest update the test runs locally! You can merge, i add the tests later :smile: :+1: