pdfminer / pdfminer.six

Community maintained fork of pdfminer - we fathom PDF
https://pdfminersix.readthedocs.io
MIT License
5.84k stars 924 forks source link

LTCurve incorrect non_stroking_color for hollow shapes #963

Open AngusWR opened 3 months ago

AngusWR commented 3 months ago

LTCurve objects seem to have an incorrect non_stroking_color for intended hollow parts of a shape. I've put together a simplified document and some code demonstrating the bug. The example PDF "donut2.pdf" contains a hollow shape that was inserted in Microsoft Word, then converted to PDF.

image

Original PDF: donut2.pdf

image

Output PDF: output.pdf

image

from pdfminer.high_level import extract_pages
from pdfminer.layout import LTCurve
from reportlab.pdfgen import canvas

def get_ltcurves(pdf_path):
    elements = []
    for page_layout in extract_pages(pdf_path):
        for element in page_layout:
            if isinstance(element, LTCurve):
                print(f"{element}\n\t{element.stroking_color}\n\t{element.non_stroking_color}")
                if isinstance(element, LTCurve):
                    elements.append(element)
    return elements

def draw_curve_from_ltcurve(ltcurve, canvas):
    c.setLineWidth(ltcurve.linewidth)
    if ltcurve.stroking_color is not None:
        c.setStrokeColorRGB(*ltcurve.stroking_color)

    if ltcurve.non_stroking_color is not None:
        c.setFillColorRGB(*ltcurve.non_stroking_color)

    path = c.beginPath()
    for element in ltcurve.original_path:
        command = element[0]
        points = element[1:]
        if command == "m":
            x, y = points[0]
            path.moveTo(x, y)
        elif command == "l":
            x, y = points[0]
            path.lineTo(x, y)
        elif command == "c":
            (x1, y1), (x2, y2), (x3, y3) = points
            path.curveTo(x1, y1, x2, y2, x3, y3)
        elif command == "h":
            path.close()

    fill_mode = 0 if ltcurve.evenodd else 1
    c.drawPath(path, stroke=ltcurve.stroke, fill=ltcurve.fill, fillMode=fill_mode)

ltcurves = get_ltcurves(r"donut2.pdf")

# NOTE "fixes" this specific case
# ltcurves[1].non_stroking_color = (1, 1, 1)

output_pdf_path = "output.pdf"
c = canvas.Canvas(output_pdf_path)

for ltcurve in ltcurves:
    draw_curve_from_ltcurve(ltcurve, c)

c.showPage()
c.save()
<LTCurve 93.600,631.920,220.200,758.520>
        None
        (0, 1, 0)
<LTCurve 125.250,663.570,188.550,726.870>
        None
        (0, 1, 0)
<LTCurve 93.600,631.920,220.200,758.520>
        (0, 0, 1)
        (0, 1, 0)
<LTCurve 125.250,663.570,188.550,726.870>
        (0, 0, 1)
        (0, 1, 0)

Possibly related to #861

dhdaines commented 2 months ago

Hi, the problem here isn't that non_stroking_color is incorrect, the problem is that fill is set on all subpaths of a complex path even if some of them should not be filled (whether the even-odd or the nonzero-winding rule is applied).

pdfminer.six doesn't actually apply either of those rules, it just returns all of the subpaths as separate LTCurve objects, and that is the real problem here.