Gracefully tokenize invalid objects

From #947.

This PDF

<< /ColorSpace @pgfcolorspaces >>

raises

pdfminer.psparser.PSSyntaxError: Invalid dictionary construct: [/'ColorSpace', /b'@', /b'pgfcolorspaces']

The cause is that << /ColorSpace @pgfcolorspaces >> is actually an invalid PDF object. With the current implementation @ and pgfcolorspaces are both tokenized as KWD's. Ideally this invalid object is tokenized as a single token with a different class from KWD.

The invalid PDF object has a << and >>. This indicates that it is a dictionary object (section 3.2.7). Dictionary keys are names (section 3.2.4) and the value can be any object. But the value in question (@pgfcolorspaces) is not a valid object because it starts with an @:

Booleans (3.2.1) are either true or false
Numerics (3.2.2) are numbers with a potential leading +, - or .
Literal strings (3.2.3) start with a (
Hexadecimal strings (3.2.3) start with a single <
Name objects (3.2.4) start with a /
Array objects (3.2.5) start with a [
Dictionary objects (3.2.6) start with a <<
Stream objects (3.2.7)start with a dictionary.
Null objects (3.2.8) are simply null
And indirect objects (3.2.9) start with a numeric object.

So starting an object with a @ is not an option in the PDF spec. So the question is: how should this unexpected object be tokenized?

Currently the tokenizer checks for a couple of special characters (%/-+.(<>) to recognize most objects (e.g. numerics, strings, names, etc.). If the token starts with a alphabetical character, it checks if it is a boolean and otherwise assumes it is a multi-character keywords (ending at the first whitespace or special character from above). All non-special non-alphabetical characters are assumed to be keywords.

This works well for PDF's with correct syntax. For example, the array keywords are parsed correctly ([ and ]). And the same for indirect objects (which use the keyword R). But for PDF's with incorrect syntax it puts all the "unexpected" characters on their on in a KWD. Ideally pdfminer.six distinguishes between expected and known keywords (using KWD) and other unexpected characters (using another class). I think that is the preferred solution here.

pdfminer / pdfminer.six

Gracefully tokenize invalid objects #968