Closed dhdaines closed 1 year ago
Thanks for flagging this, @dhdaines. You're right: That part of the documentation needs updating. There's a limit to what pdfplumber
can do to here, since we're dependent on pdfminer.six
's parsing, but I agree that there's room for improvement. Here are my observations, based on what I see in pdfminer.six
's codebase:
pdfminer.six
declares Color
to have the following type signature:
Color = Union[
float, # Greyscale
Tuple[float, float, float], # R, G, B
Tuple[float, float, float, float],
] # C, M, Y, K
For the core stroking operations, pdfminer.six
at first blush appears to follow that definition:
Method | Color space | Code snippet | Link |
---|---|---|---|
do_G |
Grayscale | cast(float, G) |
🔗 |
do_RG |
RGB | (cast(float, r), cast(float, g), cast(float, b)) |
🔗 |
do_K |
CMYK | (cast(float, c), cast(float, m), cast(float, y), cast(float, k)) |
🔗 |
... except that cast(...)
is the typing
module's cast
, which explicitly does nothing. ("This returns the value unchanged. To the type checker this signals that the return value has the designated type, but at runtime we intentionally don’t check anything (we want this to be as fast as possible).")
Moreover, there's another color-setting operation, do_SCN
, that just pops tokens off the stack without converting that list
of tokens to a tuple
.
Together, I think this explains the variation you're seeing.
As for fixes:
list
s to tuple
sint
s to float
s, although that might lead to some information loss (not in value, but of the original type in the PDF)do_G
into 1-item tuples (i.e., 0.25
becomes (0.25,)
) for consistency's sake.Thoughts on this? Other suggestions for improvements?
Ah! Thanks for the information! Sorry it took a while to reply, long weekend...
So ultimately this is really a bug in pdfminer.six
- I think casting everything to (possibly single-element) tuples of floats is the right way to go. This means that nobody has to ever call isinstance
, which is generally something to be avoided.
To the extent that there are int
s in there, they're probably bool
values in disguise, since at least in theory, pdfminer.six
only deals in float
s: https://github.com/pdfminer/pdfminer.six/blob/5114acdda61205009221ce4ebf2c68c144fc4ee5/pdfminer/psparser.py#L539
(so, there's no possible information loss from forcing them to float
)
Thanks for following up. A few notes in response:
To the extent that there are
int
s in there, they're probablybool
values in disguise, since at least in theory,pdfminer.six
only deals infloat
s: https://github.com/pdfminer/pdfminer.six/blob/5114acdda61205009221ce4ebf2c68c144fc4ee5/pdfminer/psparser.py#L539
Hmm, that doesn't seem quite right to me, for a few reasons:
1.0 0.0 0.0 rg
and also for 0 0 1 rg
)pdfminer.six
does seem to intend to handle integers and floats differently. See, in particular, PSBaseParser._parse_number(...)
:
https://github.com/pdfminer/pdfminer.six/blob/5114acdda61205009221ce4ebf2c68c144fc4ee5/pdfminer/psparser.py#L389Or perhaps I misunderstand what you're suggesting?
(so, there's no possible information loss from forcing them to
float
)
I ought to have been clearer on this point. The information loss isn't so much that there's a quantitative difference between 1
and 1.0
, but rather that this difference can be helpful for distinguishing between text/graphics/etc. that were produced by different software, or different subroutines of the same software, etc.
Right, the PDF spec (like PostScript) obviously supports both ints and floats! (unlike ahem certain popular programming languages ahem). And despite the type annotations pdfminer.six
will clearly put an int
on the stack: https://github.com/pdfminer/pdfminer.six/blob/5114acdda61205009221ce4ebf2c68c144fc4ee5/pdfminer/psparser.py#L610
I think we're in agreement that the problem here is the misunderstanding of Python typing by the pdfminer.six
authors, and we should expect that both int
and float
could be there, and could be meaningful. I'm obviously thinking about things in a machine learning perspective where everything is going to end up as a bool
or a float32
;-)
It would definitely be easier from the user's point of view if everything were a tuple
and you could just call len
on it to determine what colour space you're dealing with, but I don't know if this strays too far from the PDF spec (which I don't claim to understand even remotely). But, minimally, it would make sense for the colour values to be either a tuple of ints, a tuple of floats, an int, or a float, and not to be a list!
It would definitely be easier from the user's point of view if everything were a
tuple
and you could just calllen
on it to determine what colour space you're dealing with
Oh no, it definitely isn't as simple as that! Because the various colour spaces have a variety of different ranges and components, actually it probably isn't a great idea to type them as tuple
since they might contain a mixture of ints (sometimes in a range of [-100,100], sometimes [-128,127], etc, etc) and floats. So really they would be something like Union[int, float, Sequence[Union[int, float]]]
, unless we provide specific types for all of the possible colour spaces.
In this case it's really just a matter of updating the documentation for the moment, but really, it would be helpful to have access to the colour space from pdfplumber
.
Hm, I'm not quite sure I understand your latest note re. "actually it probably isn't a great idea to type them as tuple, since they might contain a mixture [...]". Is the argument that tuple
s should only contain homogenous types? Or that tuple
s are insufficient to represent the full set of color possibilities? Or perhaps something else?
In this case it's really just a matter of updating the documentation for the moment, but really, it would be helpful to have access to the colour space from
pdfplumber
.
Agreed. I've been working on some tweaks based on this thread. Right now, the only information pdfplumber
can expose is the non-stroking color space for char
objects. The rest (stroking color space for char
objects and either color space for the other types of objects) will require a PR to pdfminer.six
— something I plan to do eventually.
Also, in case you're curious, there's a further wrinkle, which is that colors can also be specified as "patterns" (see section 4.6 here). So I'm also working on separating out stroking/non-stroking patterns from numerically-specified colors.
Hm, I'm not quite sure I understand your latest note re. "actually it probably isn't a great idea to type them as tuple, since they might contain a mixture [...]". Is the argument that
tuple
s should only contain homogenous types?
Ah, sorry, yes that wasn't clear. Yes, it's because a tuple
needs to define the number and type of each of its elements (well, it doesn't need to do this, at least in Python, but it seems to be the general best practice). So given that any element of a colour might be an int or a float, and there might be 1, 3, 4, or ??? of them, the type definitions would get pretty ugly...
Ah. Actually, much like JavaScript, PDF considers integers and floats to be the same thing (and doesn't even specify what the range of integers is...). See https://ghostscript.com/~robin/pdf_reference17.pdf#page=52:
An integer is written as one or more decimal digits optionally preceded by a sign: 123 43445 +17 −98 0 The value is interpreted as a signed decimal integer and is converted to an integer object. If it exceeds the implementation limit for integers, it is converted to a real object.
Wherever a real number is expected, an integer may be used instead and is automatically converted to an equivalent real value. For example, it is not necessary to write the number 1.0 in real format; the integer 1 is sufficient.
In light of this I think it's reasonable to convert all numbers to Python float
.
Thanks for the quick clarification. This is testing my knowledge of Python best practices, but my general impression is that tuples
would be a decent fit for this use case. Per the Python docs:
Tuples are immutable, and usually contain a heterogeneous sequence of elements that are accessed via unpacking (see later in this section) or indexing (or even by attribute in the case of namedtuples).
Colors seem to check that list. And the type definition, seems manageable (or at least acceptable by mypy
):
color_type = Tuple[Union[int, float], ...]
I'm less concerned about the fact that these tuples can be 1, 3, or 4-length, since every individual tuple's length is fixed. Ideally, getting color space information into pdfplumber
will resolve the need to check the length of these tuples, and I'm not sure other types (list
, etc.) would provide any advantage in this respect. But perhaps I'm overlooking a particular wrinkle?
In light of this I think it's reasonable to convert all numbers to Python float.
Perhaps I could have been clearer before: The reason for not converting isn't about how int
s/float
s are handled internally by PDF or Python, or what their specs say, but rather because whether the actual tokens in the PDF use int
/float
representation can be useful information, particularly when trying to discriminate between various objects in a PDF or between PDFs.
Colors seem to check that list. And the type definition, seems manageable (or at least acceptable by
mypy
):color_type = Tuple[Union[int, float], ...]
Yes, this is fine, and acceptable to mypy
, of course... It's just not a particularly useful type definition, since it won't prevent runtime errors in code that assumes a specific configuration of the tuple in question, and the general use case of tuples is for objects that can be expected to have a well-defined shape.
But it is better than the incorrect type definition in pdfminer.six
:-) and the important thing is just that the type definitions match the documentation, which matches reality!
Roger that. And thanks for the conversation here. I have some fixes in progress, and hope to push in the next version. If additional thoughts on the topic occur to you, feel to keep posting them here.
These changes have now landed in v0.10.0. I've also added a new docs page specifically re. colors: https://github.com/jsvine/pdfplumber/blob/stable/docs/colors.md
Closing this issue, but feel free to reopen or continue the discussion if you feel like the changes missed the mark. Thanks again for bringing this up, @dhdaines 👍
Thank you! This looks very good, nice documentation as well!
Describe the bug
The documentation claims that the
"stroking_color"
( and presumably also the"non_stroking_color"
) keys are:This is quite false. Depending on the document and the phase of the moon, they could either be
float
, anint
, atuple
, or alist
.This makes it needlessly complicated to handle them since one cannot simply check
isinstance(c["stroking_color"], tuple)
to know if you have a grayscale or colour value... eventually your program will encounter one that is alist
(why?) and, maybe, crash, or worse, just give a bogus result.Environment