smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.3k stars 534 forks source link

Fix for two bugs related to Unicode translation support by Font objects #698

Closed unixnut closed 1 month ago

unixnut commented 2 months ago

Symptom was that some documents' contents was rendering as a bunch of control characters. These are the untranslated strings. This was happening because for two different reasons, these strings weren't being translated \Smalot\PdfParser\Font::decodeContent() in some circumstances.

First fix is to \Smalot\PdfParser\Font::loadTranslateTable():

Second fix is for documents that attach their Font objects to the Pages object instead of each Page object:

Type of pull request

About

Checklist for code / configuration changes

In case you changed the code/configuration, please read each of the following checkboxes as they contain valuable information:

unixnut commented 2 months ago

Unit tests have been requested by @k00ni but I lack the time to create them at this point. I think integration tests would be more useful given that the code deals with fonts attached to Pages objects, which only applies to some documents.

GreyWyvern commented 2 months ago

The first "fix" is very easy to understand; just making a regexp more robust.

The second fix is more difficult to comprehend at first. For sure I would want a unit test to provide a "broken-before", "fixed-after" demonstration. Does it even fix a specific reported issue? Is there an example PDF we can see that shows the error?

unixnut commented 2 months ago

@k00ni @GreyWyvern The PDFs that were experiencing this problem cannot be shared because of privacy reasons. I'm not sure how to generate an example document with fonts attached to the Pages object.

Can I please get some assistance in writing the integration tests for this issue?

k00ni commented 2 months ago

@unixnut sorry for the late response. I added two integration tests, but had to adapt Pages class a bit. Please have a look if its OK for you. If there are no objections, we are done here and ready to merge.

CC @GreyWyvern

GreyWyvern commented 2 months ago

After some review of the added tests, it doesn't seem like anything is actually being tested. The tests are using mock documents and setting them up in such a way that the PR code added runs successfully. However, it doesn't appear to be solving any actual issue.

The first test function tests the expectation that the added function setFonts() is called exactly once, which it is, because of the newly added code. Yet this isn't an error condition that existed before this PR was created. The data supplied for testing in a unit test should be able to cause an error in the unmodified PdfParser. In this case, the test is just testing that the added code is running, and not that it's performing some kind of fix for an issue.

The second test adds a font to $page1, then performs a bunch of seemingly unrelated actions (comments in the code needed?) before testing whether $page1->getFonts() returns the font that was added at the top of the function. It seems this should always be true? When I remove lines 106-117 of PagesTest.php the assertion still passes. What is the rest of the code testing for? The comment at the top of the function also seems duplicated from the function above.

I would really prefer that there be a test document added for this change that clearly shows a "this was not working before the PR, and now it's working" conclusion. As it stands, I still don't know what (at least the second part of) this PR is actually doing.

k00ni commented 2 months ago

@unixnut we need your feedback here, please comment.

@GreyWyvern thanks for your feedback, my comments are below:

The first test function tests the expectation that the added function setFonts() is called exactly once, which it is, because of the newly added code. Yet this isn't an error condition that existed before this PR was created. The data supplied for testing in a unit test should be able to cause an error in the unmodified PdfParser. In this case, the test is just testing that the added code is running, and not that it's performing some kind of fix for an issue.

I based my work on the assumption, that there is a PDF which fails to parse without this changes, as @unixnut said here https://github.com/smalot/pdfparser/pull/698#issuecomment-2065498305 . The PDF might be faulty in the first place, but I assume its valid and PDFParser is missing something for now.

Sorry if I am too picky here, but its an integration test not a unit test you are talking about. The reason is that I use a few actual classes together with a with mocks. Method setFonts is not called if $deep is false or Pages instance has no kids. I am a bit confused, without the new code this test would fail, because there is no setFonts function, or did I miss something in your comment?

About the last part of your comment, a test can be used for various things. One is to check if a functionality is implemented. It doesn't have to be a check for a fix.

The second test adds a font to $page1, then performs a bunch of seemingly unrelated actions (comments in the code needed?) before testing whether $page1->getFonts() returns the font that was added at the top of the function. It seems this should always be true? When I remove lines 106-117 of PagesTest.php the assertion still passes. What is the rest of the code testing for? The comment at the top of the function also seems duplicated from the function above.

Having thought about your comment and looking again on my code I see a few problems. The addition of a dummy class to pre-set fonts variable is not relevant here. It should be more about the conditions which lead to the setupFonts call (e.g. if $this->has('Kids') returns true or false, ref). I will propose something soon.

I would really prefer that there be a test document added for this change that clearly shows a "this was not working before the PR, and now it's working" conclusion. As it stands, I still don't know what (at least the second part of) this PR is actually doing.

@unixnut can you provide us the PDF in private?

k00ni commented 2 months ago

I've checked PagesTest again and added a few comments to make it more clear whats going on. Hope it helps, but if not, please comment.


I partly revise my previous comment about the test testPullRequest698DontOverride

First of all, PagesDummy was introduced to avoid the tedious work to setup fonts in Pages instance. Because of internal structure of PDFParser, you need to traverse over a couple of objects to get what you want. setFonts bypasses this.

If you remove line 106-117 the test still passes, because setupFonts works correctly. It must not override fonts of Page instances, if they are already set. Without its check the code in line 106-117 would lead to a replacement of these Font instances.

Having $font1 and $font2 to be an actual class and a mock is helpful in this case, because testing for equality leads to an error if they are not both of the same kind. The idea is to see if Page instances keep their fonts. I was thinking how to differentiate one from another, other than providing each one with a different bunch of dummy instances/mocks. I am glad I've found this, because it makes the test code leaner IMHO.

In the end all this code is obsolete if we can't confirm there is a PDF available which can not be parsed with the unfixed PDFParser.

GreyWyvern commented 2 months ago

If you remove line 106-117 the test still passes, because setupFonts works correctly. It must not override fonts of Page instances, if they are already set. Without its check the code in line 106-117 would lead to a replacement of these Font instances.

Okay, this functionality makes sense. My main concern is that the setFonts() method is added by this PR, and the Integration tests require its use for testing. This means that the tests can't be back-ported to an older version of PdfParser to see if they fail there and succeed here; they just cause an error for a missing function.

Is there a way we can test for the correct handling of fonts between Page and Pages objects without using any of the added functions?

k00ni commented 1 month ago

Okay, this functionality makes sense. My main concern is that the setFonts() method is added by this PR, and the Integration tests require its use for testing. This means that the tests can't be back-ported to an older version of PdfParser to see if they fail there and succeed here; they just cause an error for a missing function. Is there a way we can test for the correct handling of fonts between Page and Pages objects without using any of the added functions?

Good question. We could check if constructed instance(s) of Pages and Page contain correct Font instances, without explicitly use/check for setFonts. I have no complete plan right now, but will think about it.

To help constructing such a test, first, I would want know if there are tests which use the new functionality. I ran all tests but let the new method Page::setFonts threw an exception

public function setFonts($fonts)
{
    if (empty($this->fonts)) {
        if (0 < count($fonts)) { //                 <=====
            throw new Exception('fff');
        }

        $this->fonts = $fonts;
    }
}

and got the following list of "failing" tests:

A possible next step could be to check fonts in Pages + Page + the PDF itself. Then compare and see if the function works. Thats as far as I can go for now, but I hope I have further time soon.


Ping @unixnut

unixnut commented 1 month ago

Hi, @k00ni @GreyWyvern . I cannot provide the PDFs to anyone because they contain private medical data and I'm under NDA.

The second change allows PDFParser to open files that store fonts differently to what the author expected. These files open fine with Poppler-based tools.

I don't know which tool is being used to generate these files. All I know is that these changes are essential to be able to open them. Any assistance from anyone able to generate PDF files with various structures would be helpful.

Just FYI, I'm hoping we can wrap this PR up soon so I can make one for the working PDF decryption code I have written.

I am fine with your changes mentioned above, @k00ni .

k00ni commented 1 month ago

@GreyWyvern what do you think about this small test?

https://github.com/smalot/pdfparser/pull/698/files#diff-b13743b1a6f35b028e4476142bf872adb07a012dd619fc42f722aff19bb0152aR150-R187

I used the PDF of one of the tests which were shown in my little experiment here: https://github.com/smalot/pdfparser/pull/698#issuecomment-2095298642

Therefore one can assume the Font instances were created in Pages and later on passed down to each Page instance. The test doesn't depend on the setFonts method explicitly, but I am not sure though, if it wholly covers the new functionality and is what you meant before.

k00ni commented 1 month ago

@GreyWyvern I would like to merge this.

You asked if there is a way to test for the correct handling of fonts between Page and Pages objects without using any of the added functions? Does my (minimal) example test help you in this regard?

GreyWyvern commented 1 month ago

You asked if there is a way to test for the correct handling of fonts between Page and Pages objects without using any of the added functions? Does my (minimal) example test help you in this regard?

Sorry, I missed this. I don't think this really works because the test succeeds in the current version of PdfParser without this PR. Let me see if I can come up with something.

GreyWyvern commented 1 month ago

After looking at your original testPullRequest698DontOverride() it turns out I was wrong and this is actually almost what we need, @k00ni. The key change is that we shouldn't use the setFonts() method on the $page object (leave it empty) to be sure that the fonts are being passed from Pages to Page. We can use setFonts() on $pages because it's declared in the PagesDummy class earlier in the test file.

Here's what I came up with:

    public function testFontsArePassedFromPagesToPage(): void
    {
        // Create mock Document, Font and Page objects
        $document = $this->createMock(Document::class);
        $font1 = new Font($document);
        $page = new Page($document);

        // Create a Header object that indicates $page is a child
        $header = new Header([
            'Kids' => new ElementArray([
                $page,
            ]),
        ], $document);

        // Use this header to create a mock Pages object
        $pages = new PagesDummy($document, $header);

        // Apply $font1 as a Font object to this Pages object;
        // setFonts is used here as part of PagesDummy, only to access
        // the protected Pages::fonts variable; it is not a method
        // available in production
        $pages->setFonts([$font1]);

        // Trigger setupFonts method in $pages
        $pages->getPages(true);

        // Since the $page object font list is empty, $font1 from Pages
        // object must be passed to the Page object
        $this->assertEquals([$font1], $page->getFonts());

        // Create a second $font2 using a different method
        $font2 = $this->createMock(Font::class);

        // Update the fonts in $pages
        $pages->setFonts([$font1, $font2]);

        // Trigger setupFonts method in $pages
        $pages->getPages(true);

        // Now that $page already has a font, updates from $pages
        // should not overwrite it
        $this->assertEquals([$font1], $page->getFonts());
    }

At the first assertEquals() call, without the added code from this PR, $page->getFonts() returns an empty array. With the added code from the PR, the test succeeds.

Since in the current version of PdfParser fonts from Pages aren't passed from Pages to child Page objects at all, testing that existing fonts in Page objects aren't overwritten by those from the parent Pages object is not something we can say "wasn't working before and now it is" because it's essentially a new feature. The second assertEquals() call tests that updated fonts from Pages don't overwrite existing fonts in child Page objects.

This should be the only unit test needed.

k00ni commented 1 month ago

@GreyWyvern Thanks, yes that's better. I deployed your test and I think we have it. Any objections left or can we merge?

k00ni commented 1 month ago

Thank you for your effort and time here @unixnut @GreyWyvern.