jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.78k stars 678 forks source link

Can we get type of line,rect? (Dotted, Non Dotted, Empty box rects) #382

Open sreeni5493 opened 3 years ago

sreeni5493 commented 3 years ago

I wish to differentiate a dotted line vs full line. attaching a sample here. Buprenorphine.pdf Here I want to ignore dotted lines but keep the non dotted lines. Is that possible.

Similarly in the below sample, there are empty rectangles in the below sample which leads to table extraction going bad. Very interesting as to how this was created which I do not know, but I do not see any lines as such yet there are rectangles throughout almost every line of normal text. ZYDALIS TABS - PI - Kenya, Uganda - 26 10 2017.pdf

jsvine commented 3 years ago

Have you examined the way in which the dotted lines are represented in either (a) the raw PDF file, or (b) page.objects? That's usually my first step in trying to solve these sorts of questions for myself. Once you do that, report back here what you've found and we can discuss further.

sreeni5493 commented 3 years ago

I obtained both sets of lines from lines using page.lines. I will check page.objects and see if there is any possibility there. Thanks

sreeni5493 commented 3 years ago

Dotted lines and other lines are all line objects. No differentiation. Also in second case, there are empty rectangles with no visual boxes.

So it's all under lines or rects but no differentiation to tell whether its dotted or non dotted lines and in the case of rects same since there is no type of rect.

jsvine commented 3 years ago

Can you paste some examples of line objects that represent the dotted lines vs. line objects that represent the solid lines? The more explicit and detailed issues are, the easier they are to resolve and the more useful they will be to other users. Thanks!

samkit-jain commented 3 years ago

@sreeni5493 I was able to distinguish between the dotted and non-dotted lines using the stroking_color property of an edge.

For dotted lines, the stroking_color is [1]

im.draw_lines([e for e in page.edges if e['stroking_color'] == [1]], stroke_width=10)

image

So, if you want to filter out all the dotted lines, you may use the following code:

import pdfplumber

pdf = pdfplumber.open("file.pdf")
p = pdf.pages[0]

def not_dotted_line(obj):
    """Check if the line is a dotted line."""
    return obj.get("stroking_color") != [1]  # Won't be a bad idea to include more checks like on the "object_type".

p = p.filter(not_dotted_line)

im = p.to_image(resolution=200)
im.draw_lines(p.edges, stroke_width=10)
im.save("file.png", format="PNG")

Output image

I only tested this on the first PDF you uploaded but you may use a similar approach with the other.

samkit-jain commented 3 years ago

This should give you an idea of how to proceed with distinguishing between different types of lines. You may also try saving all the objects as a CSV using https://github.com/jsvine/pdfplumber#basic-example and then try to find any unique patterns and draw those objects on an image and debug.

jsvine commented 3 years ago

Thanks, @samkit-jain! From reading the raw directives in the Buprenorphine.pdf file, it appears that this PDF is setting the graphics state's "line dash pattern" (see p. 217 of the PDF spec).

pdfminer.six does parse and store the full graphics state internally, but it is not accessible via the library's interface for line objects. (See here.) However, perhaps we can file a PR on pdfminer.six to expose that information, which would be useful for other things, such as here.

sreeni5493 commented 3 years ago

@samkit-jain Yes, I did the same in this case. But if you notice in the same PDF, there are dotted lines in Black in second page. Color idea is what I thought too to solve for this case, but like @jsvine mentioned, getting line dash pattern rather than say single straight line would help in solving it in a generalized fashion. I deal with data like this (Text from Pack Inserts from tablets) which getting the order is very complex process. Pdfplumber gives us almost all the raw data that we want which is helping a lot. Thanks to you guys.

Please do let me know if in future, we can get types of lines too.

Thanks :)

sreeni5493 commented 3 years ago

With respect to second use case, I am attaching the rect object output:

ZYDALIS TABS - PI - Kenya, Uganda - 26 10 2017_rects_output.pdf

I used a different library to display the rects in PDF itself rather than image. But you can clearly see, it is marking rects when there is absolutely no lines (starting from page 2 you can see this). If we can know the type of rectangle (rectangle without vector lines or some other type), then that would also help. Issue is these rects make up tables. Maybe like camelot if we can give table accuracy, white space parameter, number of lines in a table, etc that too would help.

Also in this case the table is made of rects completely for every cell in Page 11.

The tool already is quite helpful. I can figure out roundabout ways to solve it like color, etc. Just wanted to post all the issues here so that improvements can be made on this tool :)