jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
5.99k stars 618 forks source link

original_path extraction error regarding LTCurve #1057

Open KaboChow opened 7 months ago

KaboChow commented 7 months ago

During the process of extracting shape data from a PDF, I converted the created text letter 'o' into a shape object.

image Here is the curve data I obtained.

image Normally, there should only be one set of curve data. However, it seems that there are two in this case. Here is the graphic created on the canvas using the obtained data:

image The filling color obtained for the second set of curve data is incorrect.

This is the PDF I conducted the test on: LTCurve.pdf

Is there any way to resolve this? Thank you very much.

jsvine commented 6 months ago

Hi @KaboChow, and thanks for providing this interesting example. It appears to relate to pdfplumber's main dependency, pdfminer.six.

It seems that there's some discussion of this general issue here: https://github.com/pdfminer/pdfminer.six/issues/861#issuecomment-1493442408

As it happens, however, the piece of pdfminer.six code it likely relates to is code I've contributed. Just brainstorming here, I think the issue is that folks generally want to decompose paths with multiple subpaths, for the purpose of rectangle detection. (See this test for an example.) As the issue comments above correctly point out, this makes it difficult/impossible to correctly handle more complex paths, such as shapes with holes in them.

One solution would be to propose reverting the behavior so that it does not decompose complex paths, with the downside being that some clearly rectangle-like things do not get recognized as such.

Another would be to tweak the behavior so that it mostly does not decompose complex paths except in the case of those composed entirely of rectangles. The downside would be that this may be a confusing rule, and also that some all-rectangle complex paths are still intending to be understood as shapes with holes in them.

Thanks again. Will keep thinking on this, and welcome suggestions from others, too.

KaboChow commented 5 months ago

@jsvine Thank you for your answer. Regarding the solution to this problem, I have done some processing on the obtained data, when the 'evenodd' value of two objects is false, to determine whether the boundaries of the two objects coincide, if they do, then the smaller side is the subpath, this method works for me, I hope it will be helpful for people who have the same confusion