pdfminer / pdfminer.six

Community maintained fork of pdfminer - we fathom PDF
https://pdfminersix.readthedocs.io
MIT License
5.95k stars 930 forks source link

Error regarding ‘evenodd’ value of LTCurve object #1024

Open KaboChow opened 3 months ago

KaboChow commented 3 months ago

Hello everyone!I found a problem regarding the 'evenodd' value of the object image When I try to get the data of this porous shape, the 'evenodd' values ​​obtained are all true image This is the PDF I used for testing: Spin-City-Letters-6fae9bb1b9a6b3dd0f5811b066e9ed8e (1).pdf When I use letters or numbers to convert shapes, the data recognized is correct, and the value of 'evenodd' is false. But when using a custom shape, the recognized values ​​of 'evenodd' are all true. Can anyone solve this problem?Thanks!

dhdaines commented 3 months ago

What is the expected behaviour here? If the shape was painted with the even-odd rule in the PDF, then evenodd will be set on all of its subpaths. This seems reasonable, no? If you want to know what regions are filled then you have to apply the rule.

(It looks you are actually using pdfplumber, not pdfminer.six directly, but the evenodd attribute is coming directly from pdfminer.six)

dhdaines commented 3 months ago

On further investigation it appears that this is related to https://github.com/jsvine/pdfplumber/issues/1057, which is related to #861 and #963. I'm still not quite sure what the expected behaviour should be, though.

I think the issue is that you have one path (the porous shape above) with a lot of subpaths, which has been drawn with the f*, b* or B* operator, and that pdfminer.six has split this path into a bunch of separate LTCurve shapes, which makes it impossible for you to know which ones are filled and which ones are not?

The problem here wouldn't be evenodd as that attribute only refers to whether the even-odd rule is applied to fill the shape. I think you want to know which of the LTCurve shapes are filled and which ones aren't? In this case the expected behaviour would be for pdfminer.six to set the fill attribute on those shapes.

Is this correct?

KaboChow commented 3 months ago

Hello @dhdaines, your point is correct, I fell into a misunderstanding before, the "evenodd" property can only be used to distinguish between odd and non-zero wrap rules, and in fact cannot tell whether the LTCurve shape is a hole or not. The porous shape in the example is actually a full path, but it's split into multiple LTCurve shapes for rectangular detection, which I guess is what caused the problem. As a solution to this problem, I cleared the rule of splitting LTCurve shapes, and while it doesn't seem like a good idea, the lack of rectangular detection doesn't affect me much and the problem that bothered me is solved

dhdaines commented 3 months ago

The porous shape in the example is actually a full path, but it's split into multiple LTCurve shapes for rectangular detection, which I guess is what caused the problem.

Thanks! That's kind of what I thought - your misunderstanding of evenodd is perfectly understandable, in fact it isn't useful at all when the shapes are split since there's no way to apply the fill rule. So this should still be considered a bug. My thinking on this would be either:

  1. Don't split complex paths into multiple LTCurve objects, and keep the evenodd attribute, letting the user apply the rule (non-zero winding or even-odd) to determine the filled areas.
  2. Continue splitting complex paths, and apply the rule in pdfminer.six, setting the fill attribute on the filled subpaths. Possibly remove the evenodd attribute since it is meaningless without knowing all the subpaths.
  3. Extend the pdfminer.layout API to include the concept of complex paths, or somehow expose the fact that an LTCurve is part of a larger path. Again, the user will then have to apply the fill rule.

I think @jsvine might need to weigh in on this since I think he contributed the code in question?

Probably the simplest to implement would be (1) or (3).