phpdave11 / gofpdi

Go Free PDF Document Importer
MIT License
119 stars 59 forks source link

gofpdi fails to correctly parse streams on some pdfs #53

Open napalu opened 2 years ago

napalu commented 2 years ago

When reading some PDFs (seen this typically when importing scanned-in PDFs), gofpdi will fail to detect 'endstream', panicking with panic: Failed to get content: Failed to get page content: Failed to resolve object: Expected next token to be: endstream, got: dstream.

When reading a PDF stream the reader should start reading stream after the first CRLF sequence but instead skips all leading whitespace which can result in reading past the 'endstream' token.

Here's a test PDF with described behaviour. BRW2C6FC94B5488_000827.pdf