Cannot extract text from PDF (internal font naming issue)

Hi. While trying to extract the text from PDF test.pdf I was only able to extract "\n"s. My investigation results:

While trying to parse "commandsText" (PdfParser::getCommandsText), there is an intresting issue: Given a line "/f-0-0 1 Tf" it tries to parse it with two regular expressions assuming that the font "id" will not contain "-" symbols while it actually contains.

As a result, it stops parsing without any error message and the output is incorrect.

My fix is (one more regexp):

elseif (preg_match(
                        '/^\/([A-Z0-9\._,\+\-]+\s+[0-9]+)\s+([A-Z]+)\s*/si',
                        substr($text_part, $offset),
                        $matches
                    )
                    ) {
                        $operator = $matches[2];
                        $command  = $matches[1];
                        $offset += strlen($matches[0]);
                    }

Pull request coming soon. Or maybe some remarks?

smalot / pdfparser

Cannot extract text from PDF (internal font naming issue) #145