jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.1k stars 625 forks source link

colours have inconsistent types #917

Closed dhdaines closed 1 year ago

dhdaines commented 1 year ago

Describe the bug

The documentation claims that the "stroking_color" ( and presumably also the "non_stroking_color") keys are:

expressed as a tuple or integer, depending on the “color space” used

This is quite false. Depending on the document and the phase of the moon, they could either be float, an int, a tuple, or a list.

This makes it needlessly complicated to handle them since one cannot simply check isinstance(c["stroking_color"], tuple) to know if you have a grayscale or colour value... eventually your program will encounter one that is a list (why?) and, maybe, crash, or worse, just give a bogus result.

Environment

jsvine commented 1 year ago

Thanks for flagging this, @dhdaines. You're right: That part of the documentation needs updating. There's a limit to what pdfplumber can do to here, since we're dependent on pdfminer.six's parsing, but I agree that there's room for improvement. Here are my observations, based on what I see in pdfminer.six's codebase:

pdfminer.six declares Color to have the following type signature:

Color = Union[
    float,  # Greyscale
    Tuple[float, float, float],  # R, G, B
    Tuple[float, float, float, float],
]  # C, M, Y, K

For the core stroking operations, pdfminer.six at first blush appears to follow that definition:

Method Color space Code snippet Link
do_G Grayscale cast(float, G) 🔗
do_RG RGB (cast(float, r), cast(float, g), cast(float, b)) 🔗
do_K CMYK (cast(float, c), cast(float, m), cast(float, y), cast(float, k)) 🔗

... except that cast(...) is the typing module's cast, which explicitly does nothing. ("This returns the value unchanged. To the type checker this signals that the return value has the designated type, but at runtime we intentionally don’t check anything (we want this to be as fast as possible).")

Moreover, there's another color-setting operation, do_SCN, that just pops tokens off the stack without converting that list of tokens to a tuple.

Together, I think this explains the variation you're seeing.

As for fixes:

Thoughts on this? Other suggestions for improvements?

dhdaines commented 1 year ago

Ah! Thanks for the information! Sorry it took a while to reply, long weekend...

So ultimately this is really a bug in pdfminer.six - I think casting everything to (possibly single-element) tuples of floats is the right way to go. This means that nobody has to ever call isinstance, which is generally something to be avoided.

To the extent that there are ints in there, they're probably bool values in disguise, since at least in theory, pdfminer.six only deals in floats: https://github.com/pdfminer/pdfminer.six/blob/5114acdda61205009221ce4ebf2c68c144fc4ee5/pdfminer/psparser.py#L539

dhdaines commented 1 year ago

(so, there's no possible information loss from forcing them to float)

jsvine commented 1 year ago

Thanks for following up. A few notes in response:

To the extent that there are ints in there, they're probably bool values in disguise, since at least in theory, pdfminer.six only deals in floats: https://github.com/pdfminer/pdfminer.six/blob/5114acdda61205009221ce4ebf2c68c144fc4ee5/pdfminer/psparser.py#L539

Hmm, that doesn't seem quite right to me, for a few reasons:

Or perhaps I misunderstand what you're suggesting?

(so, there's no possible information loss from forcing them to float)

I ought to have been clearer on this point. The information loss isn't so much that there's a quantitative difference between 1 and 1.0, but rather that this difference can be helpful for distinguishing between text/graphics/etc. that were produced by different software, or different subroutines of the same software, etc.

dhdaines commented 1 year ago

Right, the PDF spec (like PostScript) obviously supports both ints and floats! (unlike ahem certain popular programming languages ahem). And despite the type annotations pdfminer.six will clearly put an int on the stack: https://github.com/pdfminer/pdfminer.six/blob/5114acdda61205009221ce4ebf2c68c144fc4ee5/pdfminer/psparser.py#L610

I think we're in agreement that the problem here is the misunderstanding of Python typing by the pdfminer.six authors, and we should expect that both int and float could be there, and could be meaningful. I'm obviously thinking about things in a machine learning perspective where everything is going to end up as a bool or a float32 ;-)

It would definitely be easier from the user's point of view if everything were a tuple and you could just call len on it to determine what colour space you're dealing with, but I don't know if this strays too far from the PDF spec (which I don't claim to understand even remotely). But, minimally, it would make sense for the colour values to be either a tuple of ints, a tuple of floats, an int, or a float, and not to be a list!

dhdaines commented 1 year ago

It would definitely be easier from the user's point of view if everything were a tuple and you could just call len on it to determine what colour space you're dealing with

Oh no, it definitely isn't as simple as that! Because the various colour spaces have a variety of different ranges and components, actually it probably isn't a great idea to type them as tuple since they might contain a mixture of ints (sometimes in a range of [-100,100], sometimes [-128,127], etc, etc) and floats. So really they would be something like Union[int, float, Sequence[Union[int, float]]], unless we provide specific types for all of the possible colour spaces.

In this case it's really just a matter of updating the documentation for the moment, but really, it would be helpful to have access to the colour space from pdfplumber.

jsvine commented 1 year ago

Hm, I'm not quite sure I understand your latest note re. "actually it probably isn't a great idea to type them as tuple, since they might contain a mixture [...]". Is the argument that tuples should only contain homogenous types? Or that tuples are insufficient to represent the full set of color possibilities? Or perhaps something else?

In this case it's really just a matter of updating the documentation for the moment, but really, it would be helpful to have access to the colour space from pdfplumber.

Agreed. I've been working on some tweaks based on this thread. Right now, the only information pdfplumber can expose is the non-stroking color space for char objects. The rest (stroking color space for char objects and either color space for the other types of objects) will require a PR to pdfminer.six — something I plan to do eventually.


Also, in case you're curious, there's a further wrinkle, which is that colors can also be specified as "patterns" (see section 4.6 here). So I'm also working on separating out stroking/non-stroking patterns from numerically-specified colors.

dhdaines commented 1 year ago

Hm, I'm not quite sure I understand your latest note re. "actually it probably isn't a great idea to type them as tuple, since they might contain a mixture [...]". Is the argument that tuples should only contain homogenous types?

Ah, sorry, yes that wasn't clear. Yes, it's because a tuple needs to define the number and type of each of its elements (well, it doesn't need to do this, at least in Python, but it seems to be the general best practice). So given that any element of a colour might be an int or a float, and there might be 1, 3, 4, or ??? of them, the type definitions would get pretty ugly...

dhdaines commented 1 year ago

Ah. Actually, much like JavaScript, PDF considers integers and floats to be the same thing (and doesn't even specify what the range of integers is...). See https://ghostscript.com/~robin/pdf_reference17.pdf#page=52:

An integer is written as one or more decimal digits optionally preceded by a sign: 123 43445 +17 −98 0 The value is interpreted as a signed decimal integer and is converted to an integer object. If it exceeds the implementation limit for integers, it is converted to a real object.

Wherever a real number is expected, an integer may be used instead and is automatically converted to an equivalent real value. For example, it is not necessary to write the number 1.0 in real format; the integer 1 is sufficient.

In light of this I think it's reasonable to convert all numbers to Python float.

jsvine commented 1 year ago

Thanks for the quick clarification. This is testing my knowledge of Python best practices, but my general impression is that tuples would be a decent fit for this use case. Per the Python docs:

Tuples are immutable, and usually contain a heterogeneous sequence of elements that are accessed via unpacking (see later in this section) or indexing (or even by attribute in the case of namedtuples).

Colors seem to check that list. And the type definition, seems manageable (or at least acceptable by mypy):

color_type = Tuple[Union[int, float], ...]

I'm less concerned about the fact that these tuples can be 1, 3, or 4-length, since every individual tuple's length is fixed. Ideally, getting color space information into pdfplumber will resolve the need to check the length of these tuples, and I'm not sure other types (list, etc.) would provide any advantage in this respect. But perhaps I'm overlooking a particular wrinkle?


In light of this I think it's reasonable to convert all numbers to Python float.

Perhaps I could have been clearer before: The reason for not converting isn't about how ints/floats are handled internally by PDF or Python, or what their specs say, but rather because whether the actual tokens in the PDF use int/float representation can be useful information, particularly when trying to discriminate between various objects in a PDF or between PDFs.

dhdaines commented 1 year ago

Colors seem to check that list. And the type definition, seems manageable (or at least acceptable by mypy):

color_type = Tuple[Union[int, float], ...]

Yes, this is fine, and acceptable to mypy, of course... It's just not a particularly useful type definition, since it won't prevent runtime errors in code that assumes a specific configuration of the tuple in question, and the general use case of tuples is for objects that can be expected to have a well-defined shape.

But it is better than the incorrect type definition in pdfminer.six :-) and the important thing is just that the type definitions match the documentation, which matches reality!

jsvine commented 1 year ago

Roger that. And thanks for the conversation here. I have some fixes in progress, and hope to push in the next version. If additional thoughts on the topic occur to you, feel to keep posting them here.

jsvine commented 1 year ago

These changes have now landed in v0.10.0. I've also added a new docs page specifically re. colors: https://github.com/jsvine/pdfplumber/blob/stable/docs/colors.md

Closing this issue, but feel free to reopen or continue the discussion if you feel like the changes missed the mark. Thanks again for bringing this up, @dhdaines 👍

dhdaines commented 1 year ago

Thank you! This looks very good, nice documentation as well!