Open luanmota opened 1 year ago
Hi @luanmota Appreciate your interest in the library. The non_stroking_color
in pdfplumber comes from the pdfminer.six' PDFGraphicState.ncolor
. My recommendation would be to also open an issue (or start a discussion) on the pdfminer.six repo.
Hi @luanmota, a couple of additional notes:
I've never heard of color components using string literals. (Interesting!) My guess is that this is against the PDF spec, although I can't find a direct source for that.
Try applying pdfplumber.utils. resolve_and_decode(...)
on the non_stroking_color
values. Does that work for you?
Hey @jsvine thanks for your time and help!
Sorry for the "noob" question, but what exactly pdfplumber.utils.resolve_and_decode(...)
does? I didn't find this function in the documentation.
Thanks!
@luanmota, no apology necessary! That's a utility method that's mainly used internally (and thus not listed in the core documentation), but might be useful here for your edge-case. It resolves any indirect object references (not an issue for you) and converts any PSLiteral
s into standard text (your issue). You can see its implementation here: https://github.com/jsvine/pdfplumber/blob/ee48b26099a614b9e97465963a5ff46aa2b04e46/pdfplumber/utils/pdfinternals.py#L19-L34
@jsvine I tested the resolve_and_decode
and in some cases non_stroking_color is a list with this inside: /'Pattern1'
Do you jave any ideia what it can be? I tried to find in the pdfminer.six but nothing there.
I think we can close this issue if you don't have any ideia how I can undertand what is this return. I find another problem with this PDF with duplicate chars and resolve with dedupe_chars
function. Pdfplumber is a really great tool!!! Thanks again for the help :)
Thanks for the kind words @luanmota, and thanks for the very interesting example. The PDF specification has a section ("4.6 Patterns") on patterns, and it seems like this is what the non-stroking-color value is trying to use. Per the example of p. 296–297, it seems that this approach is valid. (My mistake on thinking it was invalid earlier.)
Accessing details about the pattern is possible, using page.page_obj.resources
to access the raw resource information gathered by pdfminer.six
. E.g., for your example:
page = pdf.pages[33]
p1 = page.page_obj.resources["Pattern"]["Pattern1"]
print(pdfplumber.utils.resolve_and_decode(p1))
... which gives you:
{'Matrix': [0.75, 0, 0, -0.75, 0, 841.92004],
'PatternType': 2,
'Shading': {'ColorSpace': 'DeviceRGB',
'Coords': [0, 152.48, 0, 153.75999],
'Extend': [True, True],
'Function': {'Bounds': [0.5, 0.5],
'Domain': [0, 1],
'Encode': [0, 1, 0, 1, 0, 1],
'FunctionType': 3,
'Functions': [{'C0': [0, 0, 0.03922],
'C1': [0, 0, 0.03922],
'Domain': [0, 1],
'FunctionType': 2,
'N': 1},
{'C0': [0, 0, 0.03922],
'C1': [0, 0, 0],
'Domain': [0, 1],
'FunctionType': 2,
'N': 1},
{'C0': [0, 0, 0],
'C1': [0, 0, 0],
'Domain': [0, 1],
'FunctionType': 2,
'N': 1}]},
'ShadingType': 2},
'Type': 'Pattern'}
Describe the bug
In my code I check if an obj is a rect and do some filters using the non_stroking_color property. But in one pdf the non_stroking_color is a PSLiteral obj and not a float. And if a change my code to check if the non_stroking_color is a float, the text is extracted with triple letters in each word.
Code to reproduce the problem
PDF file
Edital053_Assinado.pdf
Environment