Hi.
While trying to extract the text from PDF
test.pdf
I was only able to extract "\n"s.
My investigation results:
While trying to parse "commandsText" (PdfParser::getCommandsText), there is an intresting issue:
Given a line "/f-0-0 1 Tf" it tries to parse it with two regular expressions assuming that the font "id" will not contain "-" symbols while it actually contains.
As a result, it stops parsing without any error message and the output is incorrect.
Hi. While trying to extract the text from PDF test.pdf I was only able to extract "\n"s. My investigation results:
While trying to parse "commandsText" (PdfParser::getCommandsText), there is an intresting issue: Given a line "/f-0-0 1 Tf" it tries to parse it with two regular expressions assuming that the font "id" will not contain "-" symbols while it actually contains.
As a result, it stops parsing without any error message and the output is incorrect.
My fix is (one more regexp):
Pull request coming soon. Or maybe some remarks?