smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.36k stars 537 forks source link

Cannot extract text from PDF (internal font naming issue) #145

Closed madmax-inc closed 1 year ago

madmax-inc commented 7 years ago

Hi. While trying to extract the text from PDF test.pdf I was only able to extract "\n"s. My investigation results:

While trying to parse "commandsText" (PdfParser::getCommandsText), there is an intresting issue: Given a line "/f-0-0 1 Tf" it tries to parse it with two regular expressions assuming that the font "id" will not contain "-" symbols while it actually contains.

As a result, it stops parsing without any error message and the output is incorrect.

My fix is (one more regexp):

elseif (preg_match(
                        '/^\/([A-Z0-9\._,\+\-]+\s+[0-9]+)\s+([A-Z]+)\s*/si',
                        substr($text_part, $offset),
                        $matches
                    )
                    ) {
                        $operator = $matches[2];
                        $command  = $matches[1];
                        $offset += strlen($matches[0]);
                    }

Pull request coming soon. Or maybe some remarks?

GreyWyvern commented 1 year ago

@madmax-inc PR #614 fixes this issue. May we use your test.pdf file in the PdfParser test suite?