The buffering in PSBaseParser (and associated code contortions like the mess that was PDFContentParser) was fragile but also unnecesary when doing file I/O - cPython's BufferedReader implementation is considerably faster, so we can just reimplement the "parser" (really a lexer) as a character-based state machine.
But this actually would lead to an overall slowdown, because in reality, most of the time we aren't parsing PDF data from a buffered file, but from a BytesIO wrapped around an in-memory buffer. In this case, the buffering is redundant but nonetheless faster in practice since it avoids the overhead of calling BytesIO.read repeatedly.
The obvious solution is to create a separate "parser" (really a lexer) using good old regular expressions the way our ancestors intended, and simply use this one when passed an in-memory buffer.
Also means that there is a bit less inheritance abuse in the code, as PSStackParser needs to delegate to the appropriate implementation.
Fixes: #885 and #1025
Also there were some details of the PDF parsing that were incorrect. Most notably, hex strings with odd length are supposed to be padded in big-endian fashion (i.e. <abcde> is equivalent to <abcde0>) but this was not the case in the existing code (which treated this as <abcd0e> instead).
Tested on the usual test suite with nox, profiled with cProfile and time.time.
The buffering in
PSBaseParser
(and associated code contortions like the mess that wasPDFContentParser
) was fragile but also unnecesary when doing file I/O - cPython'sBufferedReader
implementation is considerably faster, so we can just reimplement the "parser" (really a lexer) as a character-based state machine.But this actually would lead to an overall slowdown, because in reality, most of the time we aren't parsing PDF data from a buffered file, but from a
BytesIO
wrapped around an in-memory buffer. In this case, the buffering is redundant but nonetheless faster in practice since it avoids the overhead of callingBytesIO.read
repeatedly.The obvious solution is to create a separate "parser" (really a lexer) using good old regular expressions the way our ancestors intended, and simply use this one when passed an in-memory buffer.
Also means that there is a bit less inheritance abuse in the code, as
PSStackParser
needs to delegate to the appropriate implementation.Fixes: #885 and #1025
Also there were some details of the PDF parsing that were incorrect. Most notably, hex strings with odd length are supposed to be padded in big-endian fashion (i.e.
<abcde>
is equivalent to<abcde0>
) but this was not the case in the existing code (which treated this as<abcd0e>
instead).Tested on the usual test suite with nox, profiled with cProfile and time.time.
Checklist