smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.37k stars 538 forks source link

Major Update to PDFObject.php + Ancillary #634

Closed GreyWyvern closed 11 months ago

GreyWyvern commented 1 year ago

Okay, here it is! I fully expect this will take quite some time to review and merge and likely require many more commits before it's ready to go; please mark it as a draft if that fits better.

It fully passes all unit tests here (PHP 8.2.7), even a couple that were marked "linux only" from which I removed that criteria. Several existing unit test assertions have been altered simply because of the way the update now parses document stream data differently, and thus generates arrays of commands differently. As examples: BT commands (as well as several others) are now stored instead of discarded, and outside of (strings) and <<dictionaries>> whitespace is normalized.

However, since it parses document stream data in a completely new way, what I'm most interested in is whether or not it causes new errors in PDFs that aren't in the test suite. So I hope quite a few people decide to test drive this.


PHPObject.php

This is a major update to the PHPObject.php file. Where previously PdfParser would attempt to gather document stream data using a series of multiline regular expressions focusing on BT ... ET blocks, this update changes the behaviour of cleanContent() adds a new function formatContent() that considers the entire document stream. It takes the following steps:

By using this system, it is then much easier to examine and parse the document stream in a line-by-line manner, instead of multiline PCRE extraction. getSectionsText() has been updated to do just this, stepping through the output of cleanContent() formatContent() line-by-line and returning an array of only the relevant commands needed to position and display text.

The guts of getText() have been moved to getTextArray() to reduce code duplication. getTextArray() now takes into account both the current graphics position cm and the scaling factors of the text matrix Tm when adding \n, \t and space " " whitespace for positioning. Positioning is only taken into account at the point of inserting text, rather than whenever a Tm or Td/TD is found.

getTextArray() now also treats Q and q as a stack of stored states rather than a single stored state. Both font Tf and graphics positioning cm are stored.

The presence of ActualText BDC commands is also taken into account and the contents of the ActualText will replace the formerly output text in both content and position. This requires the new parseDictionary() method to reliably extract such commands as well as any others PdfParser may take into account in the future.

Font.php

decodeText() in Font.php now takes into account the current text matrix font size and scale when considering whether or not to define strings of text as "words" that require spaces between them.

In decodeContent() in Font.php add a check to see if the string to decode has the UTF-16BE BOM and decode it directly as Unicode if so.

Page.php

In Page.php remove the addition of a "fake" BT command as the content stream now records them.

Add a check to see if there are remaining texts to use from PDFObject::getTextArray() before proceeding in getDataTm() which prevents "undefined array key" PHP errors.

Also prevent ET commands from resetting the font Tf as PR #629 did for BT commands.

Issues affected

Resolves #219. Resolves #353. Resolves #398. Current v2.7.0 fixes text direction, but this PR fixes all spacing issues. Resolves #464. Removes duplicated text by examining ActualText commands. Resolves #474. Resolves #508. Resolves #528. Fixes spacing issues. Resolves #537. Resolves #564. Current v2.7.0 fixes text extraction, but this PR fixes spacing issues. Resolves #568. Fixes spacing issues. Resolves #575. Resolves #576. Resolves #585. Resolves #608. Fixes headings by taking into account the graphics position cm. Resolves #628. Resolves #637.

[^1]: This prevents non-document-stream data from being passed to cleanContent() formatContent() such as JPEG data in file '12249.pdf' from https://github.com/smalot/pdfparser/issues/458 [^2]: https://ia801001.us.archive.org/1/items/pdf1.7/pdf_reference_1-7.pdf - Appendix A; https://archive.org/download/pdf320002008/PDF32000_2008.pdf - Annex A

GreyWyvern commented 1 year ago

Hrms. It seems all the Windows PHP checks are failing because the Fileinfo functions haven't been installed. They're installed by default on Linux, but the Installation manual says: 'Windows users must include the bundled php_fileinfo.dll DLL file in php.ini to enable this extension.' So Windows users have the DLL file, but I guess it's just not enabled in a "default" install?

Edit: This has been resolved by using a different method to detect whether a content stream is binary and can be safely ignored.

For "PHPUnit coverage (PHP 7.4)" it is failing an assertion where it tests to see if PDFParser's memory usage is higher than a checked value, but with my edits, the memory usage is actually lower than it expects, so that's probably actually a win? :D

https://github.com/smalot/pdfparser/blob/f3a5a3ea20f1f7d4f70f909c70a6c9c0a4a6fc63/tests/PHPUnit/Integration/ParserTest.php#L379C22-L379C22

Edit: This has been resolved by checking for a fixed numerical value for memory use above the $baselineMemory instead of a multiple of it in the ParserTest::testRetainImageContentImpact() method.

Edit: It's very possible that the ~~Fileinfo MIME-type mb_detect_encoding() check is what's reducing the memory use since non-document-stream data is rejected before the various interpreting functions in PHPObject.php and Font.php tries to work on it. As an example, There are 4 (I think?) streams that get passed to getSectionsText() from 12249.pdf in #458 and one of them is actually binary JPEG data.~~

k00ni commented 1 year ago

@GreyWyvern Thank you very much for all your work. I will see how I manage to give more feedback soon, but I hope that our community has time to comment too. Surprisingly I can't mark this PR as draft.

k00ni commented 1 year ago

However, since it parses document stream data in a completely new way, what I'm most interested in is whether or not it causes new errors in PDFs that aren't in the test suite. So I hope quite a few people decide to test drive this.

Can you elaborate a bit about which kind of PDFs are not there yet? We could "reproduce" some missing things by using unit tests instead.

GreyWyvern commented 1 year ago

Can you elaborate a bit about which kind of PDFs are not there yet? We could "reproduce" some missing things by using unit tests instead.

I just meant unknown PDFs "in-the-wild" in general. I run it on my own org's collection of PDFs (~400) for searching and it works fine. Almost all of them worked in v2.7.0 too, but those are PDFs all generated by one entity. There's just no telling, with a change this large, if there might be a PDF out there that works in v2.7.0 but not with this PR.

We can't really tell without people actually using it, so I think putting out a release candidate is a very good idea.

k00ni commented 1 year ago

I just meant unknown PDFs "in-the-wild" in general. I run it on my own org's collection of PDFs (~400) for searching and it works fine. Almost all of them worked in v2.7.0 too, but those are PDFs all generated by one entity. There's just no telling, with a change this large, if there might be a PDF out there that works in v2.7.0 but not with this PR.

I created a few PDFs using the following tools:

I found in the Chromium one that there are too many white spaces added in the result of getText. I marked it: https://github.com/smalot/pdfparser/pull/634/files#diff-50498d428341deceb2ce9d29159c86631d951c9260817d46c4f333fe7feecfabR311-R312

Does this help? I am not sure if that approach gonna work, but we could generate more PDFs using other PDF generators and check their results.

GreyWyvern commented 1 year ago

Does this help?

Oh yeah, this helps a lot! Thanks! Definitely some of the numbers need tweaking. I'll see if I can generate a couple myself.

GreyWyvern commented 1 year ago

I feel like this is getting close, but we should still keep throwing more PDFs at it for a while.

k00ni commented 1 year ago

What other PDF generators are worth using to generate PDFs?

GreyWyvern commented 1 year ago

This latest commit (5a3a39b) solves a spacing issue I found with a PDF generated by "macOS Version 10.14.6 (Build 18G3020) Quartz PDFContext" but I don't have macOS to generate such a file for /samples/. If someone could do that, especially one with large font-sizes (like 20pt+) several newlines between titles and paragraphs and superscript 'th' or 'rd' as in 4th or 3rd, etc. that would be appreciated!

Reqrefusion commented 1 year ago

Wow this is truly 2.0. If you are looking for PDFs created in different ways, I recommend public documents (Laws, regulations, announcements, etc.). Since I use PDFparser for these, I can guarantee that all of them are created in completely different ways that you cannot imagine. Especially public announcements are a complete mess from different sources. Moreover, they are not the subject of any copyright. This makes it possible to add more complex and diverse samples as tests.

k00ni commented 1 year ago

@Reqrefusion thank you for the hint and its good to see you around. This update could definitely be a version 2.0, but we don't have to increase major version, which is great, because all users using ^2 will get it and don't have to change their composer.json.

I made a few adaptions and additions:

Hope it helps.

k00ni commented 1 year ago

@GreyWyvern I took another look at some of your new functions. Can you mark all functions @internal which are only meant for internal use?

GreyWyvern commented 1 year ago

I am reasonably certain I haven't touched any of the image extraction code. And yes, these two tests also fail (expected: 2, actual: 0) in the currently released v2.7.0. They can probably be removed or worked into another image-focused PR.

Ah! This is due to an oversight in my previous PR #627 where it does not check far enough ahead for escaped backslashes. The PDF string in the document at this point is '\000\\\000'.

Current:

    /**
     * Decode string with octal-decoded chunks.
     */
    public static function decodeOctal(string $text): string
    {
        $parts = preg_split('/(?<!\\\\)(\\\\[0-7]{1,3})/s', $text, -1, \PREG_SPLIT_NO_EMPTY | \PREG_SPLIT_DELIM_CAPTURE);
        $text = '';

Fixed:

    /**
     * Decode string with octal-decoded chunks.
     */
    public static function decodeOctal(string $text): string
    {
        $parts = preg_split('/(?<![^\\\\]\\\\)(\\\\[0-7]{1,3})/s', $text, -1, \PREG_SPLIT_NO_EMPTY | \PREG_SPLIT_DELIM_CAPTURE);
        $text = '';

I can bundle it into this PR, or I can make a new PR for it if you like. I just slipped the fix into this PR. ;)

Hope it helps.

Sure, thanks! I will take a look at your other notes (unnecessary tabs and such) and see if they're fixable.

@GreyWyvern I took another look at some of your new functions. Can you mark all functions @internal which are only meant for internal use?

I created two major new functions for this PR: formatContent() and parseDictionary(). I actually think both could be useful for end users to know about, but OTOH, you do need the raw PDF document stream output in order to use either of them, so any use would be pretty niche. I'll rely on your judgement @k00ni, if you think they should be marked as @internal I'll do that.

k00ni commented 1 year ago

@GreyWyvern

I'll rely on your judgement @k00ni, if you think they should be marked as @internal I'll do that.

Please only mark the functions you added in this PR with @internal. If you are unsure, leave it as it is. I would like to keep our public API as small as possible for now or at least not let it expand it uncontrolled.

About the images: Just remove test-code then, if you didn't change any image related code.


I thought about the size of this PR and would like to know, if there is a reasonable way to split this PR up, so we can start to integrate it iteratively? There is no rush here, but on the other hand there is only feedback if people can try it out in the wild (using release candidates). It is up to you @GreyWyvern, its just a suggestion.

GreyWyvern commented 1 year ago

Please only mark the functions you added in this PR with @internal. If you are unsure, leave it as it is. I would like to keep our public API as small as possible for now or at least not let it expand it uncontrolled.

Will do! These functions are not really something a regular user would use, so @internal makes enough sense.

About the images: Just remove test-code then, if you didn't change any image related code.

Removed.

I thought about the size of this PR and would like to know, if there is a reasonable way to split this PR up, so we can start to integrate it iteratively? There is no rush here, but on the other hand there is only feedback if people can try it out in the wild (using release candidates). It is up to you @GreyWyvern, its just a suggestion.

Well... the main change of function is in the two functions formatContent() and getSectionsText(). Especially with the latter, this is an all-or-nothing switch. Without the new way the document stream is formatted by these two functions, the rest of the changes (especially getTextArray()) won't really work.

The BOM-related changes in Font.php, and all the changes in Page.php (except removing lines 403-404) are really the only changes that are essentially entirely unrelated to the meat of this PR. I will revert my change to Font::decodeOctal() in prep for the other PR I've made.

I think at most I would be able to split this into two PRs, the first of which would be all the additions that don't actually change execution flow (like adding formatContent() and parseDictionary() but not calling them anywhere). And then the ones that do affect execution flow second. Personally, that just feels like unnecessary extra paperwork to me. I would rather it be merged all at once, and then we do release candidates instead of a full release.

I'm not in a hurry. I'm happy to allow you all the time you need to review :)

k00ni commented 1 year ago

Personally, that just feels like unnecessary extra paperwork to me. I would rather it be merged all at once, and then we do release candidates instead of a full release.

Alright. If there are no further objections, we can merge it all at once.

How is your timetable for this PR?

CC @j0k3r @smalot

k00ni commented 11 months ago

@GreyWyvern A quick follow up: Are you planning any updates on this one in the near future? I won't have that much time in the next weeks/months most likely. I remember your suggestion to "throw further PDFs" at the code.

I suggest the following next steps:

This way your work gets out to more people and we can observe, if there are any remaining bugs in your code. I can help organize further steps regarding releases or issues.

WDYT?

GreyWyvern commented 11 months ago

WDYT?

I think this is a good plan. I have actually found a couple more PDFs that PdfParser has trouble with since I last visited, but related to embedded images rather than text-extraction.

Putting out a release candidate will certainly help get the new code more testing. The most important thing to find are PDFs that v2.7.0 parses "correctly" while this new version does not. That's where the most tweaking information will come from. :)

I will mark this PR as Ready for review.

k00ni commented 11 months ago

@j0k3r any objections? I would merge it right away and proceed as suggested.

j0k3r commented 11 months ago

@k00ni This is good to me

k00ni commented 11 months ago

Big thanks to @GreyWyvern and all commentators.