smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.37k stars 538 forks source link

fix: Fail to identify fonts (name and size) #629 #630

Closed mbideau-atreal closed 1 year ago

mbideau-atreal commented 1 year ago

In order to solve #629 I identified that BT commands were resetting the font name and size previously defined by a Tf command... all the time. So I just removed the font (over)initialization from the BT code, and all the font names and size were showing up. The (first) initialization of $fontId and $fontSize is already done before the loop starts, BTW.

PS: I don't know anything about PDF standard, so please take my fix with caution.

Thanks for this great piece of code. Best regards.

k00ni commented 1 year ago

CC @GreyWyvern Just a hint, because this is related to fonts, which was the topic of one of recent your pull requests.

GreyWyvern commented 1 year ago

My own PR that I'm working on removes these same lines from Page.php as unnecessary as well. If font can be set outside BT ... ET blocks then there is no need to reset it each time we enter a new one.

k00ni commented 1 year ago

@mbideau-atreal I need a simple unit-test to prove the change is working and to avoid a regression in the future. Can you do that? If not, I can suggest one.

GreyWyvern commented 1 year ago

@mbideau-atreal May we use your sample PDF 20230803-160138-lettretype-arrete.pdf in the PdfParser test suite? Then we could add to the bottom of PageTest::testGetDataTm():

        // Check that BT and ET do not reset the font
        $config = new Config();
        $config->setDataTmFontInfoHasToBeIncluded(true);

        $filename = $this->rootDir.'/samples/bugs/Issue629.pdf';
        $parser = $this->getParserInstance($config);
        $document = $parser->parseFile($filename);
        $pages = $document->getPages();
        $page = end($pages);
        $dataTm = $page->getDataTm();

        $this->assertCount(4, $dataTm[0]);
        $this->assertEquals('F2', $dataTm[0][2]);

The first test should always be true if setDataTmFontInfoHasToBeIncluded is true. The second test is only true if $fontId and $fontSize lines are removed from Page.php.

I would also recommend removing the same lines from further down under case 'ET': as well.

mbideau-atreal commented 1 year ago

Sorry for my delay. Great work :+1: And yes, you can use my PDF in your test case... thank you very much for asking.

mbideau-atreal commented 1 year ago

Nothing left to do here, on my part :wink:

After the merge, if you can, I would really need some feedback on the issue #570 regarding the same PDF sample document :pray:

Thanks again.