Open kensouchen opened 5 years ago
sry, just checked, the rect / line to edge method is great. then the filter should be added at: https://github.com/jsvine/pdfplumber/blob/8fa335247f960b73f0cfd397eab649cda792e23a/pdfplumber/utils.py#L448
when filtering them.
A belated thanks for raising this issue. Your proposed solution (filtering out invisible edges) is an interesting one. I worry that it will cause problems for certain tables, where invisible lines are necessary for proper parsing (i.e., they delineate the table structure, but are made color-less so that they’re less distracting to human readers), so I like the idea of adding it as a potential extraction option, rather than the default. Will consider implementing.
(Just a note that I had initially misread the suggestion and have updated the comment above. Thanks!)
@kensouchen Do you happen to have a PDF that demonstrates this issue, that you can share, for use as a test?
Hello there,
first of all, thank you for this great implementation. I am experimenting with your package and think the proposed feature can help a lot on specific tables. In my experiments, the problem appears when the table header has a different background color than white, especially darker colors like gray.
Is there any news on that feature-request?
Here is an example, where the background color of the header infers the table structure:
Here is the pdf file: simple_table.pdf
Thank you in advance!
@JBBalling In your case, the PDF you provided results in properly extracting the table using the extraction settings as
{
"vertical_strategy": "lines",
"horizontal_strategy": "lines",
"snap_tolerance": 10,
}
Furthermore, if you want to remove certain objects from the page before the table extraction, you can leverage the page.filter()
method. Usage you can find here.
Hi,
I met this issue when using your package:
Sometimes, the pdf will have some invisable lines / rects, which interferes the table extraction result.
I want to get a pure explicit line chart with explicit line strategy, but I dont think it works.
A easy way to do it should be set a filter for LTLine/ LTRects in your parse_object stage, where I filter out the lines with stroking_color = 1 ( when stroke is True) or non_stroking_color=1 (when stroke is False)
Also many "Rects" are actually lines... well, every line is a rect. is there a rect2line method in your implementation? My email is zchen344@bloomberg.net, I might be wrong, but let me spend some more time to read this package. thanks!