Table Extraction Option to ignore visiable lines / rects

jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

MIT License

6k stars 618 forks source link

Table Extraction Option to ignore visiable lines / rects #122

Open kensouchen opened 5 years ago

kensouchen commented 5 years ago

Hi,

I met this issue when using your package:

Sometimes, the pdf will have some invisable lines / rects, which interferes the table extraction result.

I want to get a pure explicit line chart with explicit line strategy, but I dont think it works.

A easy way to do it should be set a filter for LTLine/ LTRects in your parse_object stage, where I filter out the lines with stroking_color = 1 ( when stroke is True) or non_stroking_color=1 (when stroke is False)

Also many "Rects" are actually lines... well, every line is a rect. is there a rect2line method in your implementation? My email is zchen344@bloomberg.net, I might be wrong, but let me spend some more time to read this package. thanks!

kensouchen commented 5 years ago

sry, just checked, the rect / line to edge method is great. then the filter should be added at: https://github.com/jsvine/pdfplumber/blob/8fa335247f960b73f0cfd397eab649cda792e23a/pdfplumber/utils.py#L448

when filtering them.

jsvine commented 3 years ago

A belated thanks for raising this issue. Your proposed solution (filtering out invisible edges) is an interesting one. I worry that it will cause problems for certain tables, where invisible lines are necessary for proper parsing (i.e., they delineate the table structure, but are made color-less so that they’re less distracting to human readers), so I like the idea of adding it as a potential extraction option, rather than the default. Will consider implementing.

jsvine commented 3 years ago

(Just a note that I had initially misread the suggestion and have updated the comment above. Thanks!)

jsvine commented 3 years ago

@kensouchen Do you happen to have a PDF that demonstrates this issue, that you can share, for use as a test?

JBBalling commented 1 year ago

Hello there,

first of all, thank you for this great implementation. I am experimenting with your package and think the proposed feature can help a lot on specific tables. In my experiments, the problem appears when the table header has a different background color than white, especially darker colors like gray.

Is there any news on that feature-request?

Here is an example, where the background color of the header infers the table structure: Bildschirmfoto von 2022-12-06 12-34-18

Here is the pdf file: simple_table.pdf

Thank you in advance!

samkit-jain commented 1 year ago

@JBBalling In your case, the PDF you provided results in properly extracting the table using the extraction settings as

{
    "vertical_strategy": "lines",
    "horizontal_strategy": "lines",
    "snap_tolerance": 10,
}

Furthermore, if you want to remove certain objects from the page before the table extraction, you can leverage the page.filter() method. Usage you can find here.