The cause is that << /ColorSpace @pgfcolorspaces >> is actually an invalid PDF object. With the current implementation @ and pgfcolorspaces are both tokenized as KWD's. Ideally this invalid object is tokenized as a single token with a different class from KWD.
The invalid PDF object has a << and >>. This indicates that it is a dictionary object (section 3.2.7). Dictionary keys are names (section 3.2.4) and the value can be any object. But the value in question (@pgfcolorspaces) is not a valid object because it starts with an @:
Booleans (3.2.1) are either true or false
Numerics (3.2.2) are numbers with a potential leading +, - or .
Literal strings (3.2.3) start with a (
Hexadecimal strings (3.2.3) start with a single <
Name objects (3.2.4) start with a /
Array objects (3.2.5) start with a [
Dictionary objects (3.2.6) start with a <<
Stream objects (3.2.7)start with a dictionary.
Null objects (3.2.8) are simply null
And indirect objects (3.2.9) start with a numeric object.
So starting an object with a @ is not an option in the PDF spec. So the question is: how should this unexpected object be tokenized?
Currently the tokenizer checks for a couple of special characters (%/-+.(<>) to recognize most objects (e.g. numerics, strings, names, etc.). If the token starts with a alphabetical character, it checks if it is a boolean and otherwise assumes it is a multi-character keywords (ending at the first whitespace or special character from above). All non-special non-alphabetical characters are assumed to be keywords.
This works well for PDF's with correct syntax. For example, the array keywords are parsed correctly ([ and ]). And the same for indirect objects (which use the keyword R). But for PDF's with incorrect syntax it puts all the "unexpected" characters on their on in a KWD. Ideally pdfminer.six distinguishes between expected and known keywords (using KWD) and other unexpected characters (using another class). I think that is the preferred solution here.
From #947.
This PDF
raises
The cause is that
<< /ColorSpace @pgfcolorspaces >>
is actually an invalid PDF object. With the current implementation@
andpgfcolorspaces
are both tokenized asKWD
's. Ideally this invalid object is tokenized as a single token with a different class fromKWD
.The invalid PDF object has a
<<
and>>
. This indicates that it is a dictionary object (section 3.2.7). Dictionary keys are names (section 3.2.4) and the value can be any object. But the value in question (@pgfcolorspaces
) is not a valid object because it starts with an@
:true
orfalse
+
,-
or.
(
<
/
[
<<
null
So starting an object with a
@
is not an option in the PDF spec. So the question is: how should this unexpected object be tokenized?Currently the tokenizer checks for a couple of special characters (
%/-+.(<>
) to recognize most objects (e.g. numerics, strings, names, etc.). If the token starts with a alphabetical character, it checks if it is a boolean and otherwise assumes it is a multi-character keywords (ending at the first whitespace or special character from above). All non-special non-alphabetical characters are assumed to be keywords.This works well for PDF's with correct syntax. For example, the array keywords are parsed correctly (
[
and]
). And the same for indirect objects (which use the keywordR
). But for PDF's with incorrect syntax it puts all the "unexpected" characters on their on in aKWD
. Ideally pdfminer.six distinguishes between expected and known keywords (usingKWD
) and other unexpected characters (using another class). I think that is the preferred solution here.