Major Update to PDFObject.php + Ancillary

GreyWyvern commented 1 year ago

Okay, here it is! I fully expect this will take quite some time to review and merge and likely require many more commits before it's ready to go; please mark it as a draft if that fits better.

It fully passes all unit tests here (PHP 8.2.7), even a couple that were marked "linux only" from which I removed that criteria. Several existing unit test assertions have been altered simply because of the way the update now parses document stream data differently, and thus generates arrays of commands differently. As examples: BT commands (as well as several others) are now stored instead of discarded, and outside of (strings) and <<dictionaries>> whitespace is normalized.

However, since it parses document stream data in a completely new way, what I'm most interested in is whether or not it causes new errors in PDFs that aren't in the test suite. So I hope quite a few people decide to test drive this.

PHPObject.php

This is a major update to the PHPObject.php file. Where previously PdfParser would attempt to gather document stream data using a series of multiline regular expressions focusing on BT ... ET blocks, this update ~~changes the behaviour of cleanContent()~~ adds a new function formatContent() that considers the entire document stream. It takes the following steps:

Hide all (strings)
Remove all newlines and carriage-returns
Hide all <<dictionaries>>
Perform a MIME-type check for 'text/plain'[^1]
Normalize all whitespace
Using a list of valid Operators from the PDF Reference[^2], add \r\n back to the remaining text so that there is a single valid PDF command on every line
Restore the <<dictionaries>>
Encode newlines and carriage returns in (strings) as \n and \r and restore them as well

By using this system, it is then much easier to examine and parse the document stream in a line-by-line manner, instead of multiline PCRE extraction. getSectionsText() has been updated to do just this, stepping through the output of ~~cleanContent()~~ formatContent() line-by-line and returning an array of only the relevant commands needed to position and display text.

The guts of getText() have been moved to getTextArray() to reduce code duplication. getTextArray() now takes into account both the current graphics position cm and the scaling factors of the text matrix Tm when adding \n, \t and space " " whitespace for positioning. Positioning is only taken into account at the point of inserting text, rather than whenever a Tm or Td/TD is found.

getTextArray() now also treats Q and q as a stack of stored states rather than a single stored state. Both font Tf and graphics positioning cm are stored.

The presence of ActualText BDC commands is also taken into account and the contents of the ActualText will replace the formerly output text in both content and position. This requires the new parseDictionary() method to reliably extract such commands as well as any others PdfParser may take into account in the future.

Font.php

decodeText() in Font.php now takes into account the current ~~text matrix~~ font size and scale when considering whether or not to define strings of text as "words" that require spaces between them.

In decodeContent() in Font.php add a check to see if the string to decode has the UTF-16BE BOM and decode it directly as Unicode if so.

Page.php

In Page.php remove the addition of a "fake" BT command as the content stream now records them.

Add a check to see if there are remaining texts to use from PDFObject::getTextArray() before proceeding in getDataTm() which prevents "undefined array key" PHP errors.

Also prevent ET commands from resetting the font Tf as PR #629 did for BT commands.

Issues affected

Resolves #219. Resolves #353. Resolves #398. Current v2.7.0 fixes text direction, but this PR fixes all spacing issues. Resolves #464. Removes duplicated text by examining ActualText commands. Resolves #474. Resolves #508. Resolves #528. Fixes spacing issues. Resolves #537. Resolves #564. Current v2.7.0 fixes text extraction, but this PR fixes spacing issues. Resolves #568. Fixes spacing issues. Resolves #575. Resolves #576. Resolves #585. Resolves #608. Fixes headings by taking into account the graphics position cm. Resolves #628. Resolves #637.

[^1]: This prevents non-document-stream data from being passed to ~~cleanContent()~~ formatContent() such as JPEG data in file '12249.pdf' from https://github.com/smalot/pdfparser/issues/458 [^2]: https://ia801001.us.archive.org/1/items/pdf1.7/pdf_reference_1-7.pdf - Appendix A; https://archive.org/download/pdf320002008/PDF32000_2008.pdf - Annex A

GreyWyvern commented 1 year ago

Hrms. It seems all the Windows PHP checks are failing because the Fileinfo functions haven't been installed. They're installed by default on Linux, but the Installation manual says: 'Windows users must include the bundled php_fileinfo.dll DLL file in php.ini to enable this extension.' So Windows users have the DLL file, but I guess it's just not enabled in a "default" install?

Edit: This has been resolved by using a different method to detect whether a content stream is binary and can be safely ignored.

For "PHPUnit coverage (PHP 7.4)" it is failing an assertion where it tests to see if PDFParser's memory usage is higher than a checked value, but with my edits, the memory usage is actually lower than it expects, so that's probably actually a win? :D

~~https://github.com/smalot/pdfparser/blob/f3a5a3ea20f1f7d4f70f909c70a6c9c0a4a6fc63/tests/PHPUnit/Integration/ParserTest.php#L379C22-L379C22~~

Edit: This has been resolved by checking for a fixed numerical value for memory use above the $baselineMemory instead of a multiple of it in the ParserTest::testRetainImageContentImpact() method.

Edit: It's very possible that the ~~Fileinfo MIME-type mb_detect_encoding() check is what's reducing the memory use since non-document-stream data is rejected before the various interpreting functions in PHPObject.php and Font.php tries to work on it. As an example, There are 4 (I think?) streams that get passed to getSectionsText() from 12249.pdf in #458 and one of them is actually binary JPEG data.~~

k00ni commented 1 year ago

@GreyWyvern Thank you very much for all your work. I will see how I manage to give more feedback soon, but I hope that our community has time to comment too. Surprisingly I can't mark this PR as draft.

k00ni commented 1 year ago

However, since it parses document stream data in a completely new way, what I'm most interested in is whether or not it causes new errors in PDFs that aren't in the test suite. So I hope quite a few people decide to test drive this.

Can you elaborate a bit about which kind of PDFs are not there yet? We could "reproduce" some missing things by using unit tests instead.

GreyWyvern commented 1 year ago

Can you elaborate a bit about which kind of PDFs are not there yet? We could "reproduce" some missing things by using unit tests instead.

I just meant unknown PDFs "in-the-wild" in general. I run it on my own org's collection of PDFs (~400) for searching and it works fine. Almost all of them worked in v2.7.0 too, but those are PDFs all generated by one entity. There's just no telling, with a change this large, if there might be a PDF out there that works in v2.7.0 but not with this PR.

We can't really tell without people actually using it, so I think putting out a release candidate is a very good idea.

k00ni commented 1 year ago

I just meant unknown PDFs "in-the-wild" in general. I run it on my own org's collection of PDFs (~400) for searching and it works fine. Almost all of them worked in v2.7.0 too, but those are PDFs all generated by one entity. There's just no telling, with a change this large, if there might be a PDF out there that works in v2.7.0 but not with this PR.

I created a few PDFs using the following tools:

Chromium browser (export web page as PDF)
LibreOffice writer
Inkscape (PDF v1.4 and v1.5)

I found in the Chromium one that there are too many white spaces added in the result of getText. I marked it: https://github.com/smalot/pdfparser/pull/634/files#diff-50498d428341deceb2ce9d29159c86631d951c9260817d46c4f333fe7feecfabR311-R312

Does this help? I am not sure if that approach gonna work, but we could generate more PDFs using other PDF generators and check their results.

GreyWyvern commented 1 year ago

Does this help?

Oh yeah, this helps a lot! Thanks! Definitely some of the numbers need tweaking. I'll see if I can generate a couple myself.

GreyWyvern commented 1 year ago

I feel like this is getting close, but we should still keep throwing more PDFs at it for a while.

k00ni commented 1 year ago

What other PDF generators are worth using to generate PDFs?

GreyWyvern commented 1 year ago

This latest commit (5a3a39b) solves a spacing issue I found with a PDF generated by "macOS Version 10.14.6 (Build 18G3020) Quartz PDFContext" but I don't have macOS to generate such a file for /samples/. If someone could do that, especially one with large font-sizes (like 20pt+) several newlines between titles and paragraphs and superscript 'th' or 'rd' as in 4th or 3rd, etc. that would be appreciated!

Reqrefusion commented 1 year ago

Wow this is truly 2.0. If you are looking for PDFs created in different ways, I recommend public documents (Laws, regulations, announcements, etc.). Since I use PDFparser for these, I can guarantee that all of them are created in completely different ways that you cannot imagine. Especially public announcements are a complete mess from different sources. Moreover, they are not the subject of any copyright. This makes it possible to add more complex and diverse samples as tests.

k00ni commented 1 year ago

@Reqrefusion thank you for the hint and its good to see you around. This update could definitely be a version 2.0, but we don't have to increase major version, which is great, because all users using ^2 will get it and don't have to change their composer.json.

I made a few adaptions and additions:

reorganized samples folder and grouped all new PDFs into grouped-by-generator
split DocumentTest class into three test classes, so it is easier to maintain and extend in the future:
- one for general tests
- one for issue related tests
- one for generator related tests
Added further PDFs, generated by generators we might not cover yet. Also added tests for each of these PDFs. I let a few tests fail to outline certain areas in question (images + encoding).
- Two tests will fail because of images, which were not found (like https://github.com/smalot/pdfparser/pull/634/files#diff-4a0191f73e0276217a40bb93e3ce61b37d440b108b90f6d6e61f5bf53d00cb08R140-R143)
- One test will fail because of some decoding/encoding issue (I assume) - https://github.com/smalot/pdfparser/pull/634/files#diff-4a0191f73e0276217a40bb93e3ce61b37d440b108b90f6d6e61f5bf53d00cb08R227

Hope it helps.

k00ni commented 1 year ago

@GreyWyvern I took another look at some of your new functions. Can you mark all functions @internal which are only meant for internal use?

GreyWyvern commented 1 year ago

Added further PDFs, generated by generators we might not cover yet. Also added tests for each of these PDFs. I let a few tests fail to outline certain areas in question (images + encoding).

Two tests will fail because of images, which were not found (like https://github.com/smalot/pdfparser/pull/634/files#diff-4a0191f73e0276217a40bb93e3ce61b37d440b108b90f6d6e61f5bf53d00cb08R140-R143)

I am reasonably certain I haven't touched any of the image extraction code. And yes, these two tests also fail (expected: 2, actual: 0) in the currently released v2.7.0. They can probably be removed or worked into another image-focused PR.

One test will fail because of some decoding/encoding issue (I assume) - https://github.com/smalot/pdfparser/pull/634/files#diff-4a0191f73e0276217a40bb93e3ce61b37d440b108b90f6d6e61f5bf53d00cb08R227

Ah! This is due to an oversight in my previous PR #627 where it does not check far enough ahead for escaped backslashes. The PDF string in the document at this point is '\000\\\000'.

Current:

    /**
     * Decode string with octal-decoded chunks.
     */
    public static function decodeOctal(string $text): string
    {
        $parts = preg_split('/(?<!\\\\)(\\\\[0-7]{1,3})/s', $text, -1, \PREG_SPLIT_NO_EMPTY | \PREG_SPLIT_DELIM_CAPTURE);
        $text = '';

Fixed:

    /**
     * Decode string with octal-decoded chunks.
     */
    public static function decodeOctal(string $text): string
    {
        $parts = preg_split('/(?<![^\\\\]\\\\)(\\\\[0-7]{1,3})/s', $text, -1, \PREG_SPLIT_NO_EMPTY | \PREG_SPLIT_DELIM_CAPTURE);
        $text = '';

~~I can bundle it into this PR, or I can make a new PR for it if you like.~~ I just slipped the fix into this PR. ;)

Hope it helps.

Sure, thanks! I will take a look at your other notes (unnecessary tabs and such) and see if they're fixable.

@GreyWyvern I took another look at some of your new functions. Can you mark all functions @internal which are only meant for internal use?

I created two major new functions for this PR: formatContent() and parseDictionary(). I actually think both could be useful for end users to know about, but OTOH, you do need the raw PDF document stream output in order to use either of them, so any use would be pretty niche. I'll rely on your judgement @k00ni, if you think they should be marked as @internal I'll do that.

k00ni commented 1 year ago

@GreyWyvern

I'll rely on your judgement @k00ni, if you think they should be marked as @internal I'll do that.

Please only mark the functions you added in this PR with @internal. If you are unsure, leave it as it is. I would like to keep our public API as small as possible for now or at least not let it expand it uncontrolled.

About the images: Just remove test-code then, if you didn't change any image related code.

I thought about the size of this PR and would like to know, if there is a reasonable way to split this PR up, so we can start to integrate it iteratively? There is no rush here, but on the other hand there is only feedback if people can try it out in the wild (using release candidates). It is up to you @GreyWyvern, its just a suggestion.

GreyWyvern commented 1 year ago

Please only mark the functions you added in this PR with @internal. If you are unsure, leave it as it is. I would like to keep our public API as small as possible for now or at least not let it expand it uncontrolled.

Will do! These functions are not really something a regular user would use, so @internal makes enough sense.

About the images: Just remove test-code then, if you didn't change any image related code.

Removed.

I thought about the size of this PR and would like to know, if there is a reasonable way to split this PR up, so we can start to integrate it iteratively? There is no rush here, but on the other hand there is only feedback if people can try it out in the wild (using release candidates). It is up to you @GreyWyvern, its just a suggestion.

Well... the main change of function is in the two functions formatContent() and getSectionsText(). Especially with the latter, this is an all-or-nothing switch. Without the new way the document stream is formatted by these two functions, the rest of the changes (especially getTextArray()) won't really work.

The BOM-related changes in Font.php, and all the changes in Page.php (except removing lines 403-404) are really the only changes that are essentially entirely unrelated to the meat of this PR. I will revert my change to Font::decodeOctal() in prep for the other PR I've made.

I think at most I would be able to split this into two PRs, the first of which would be all the additions that don't actually change execution flow (like adding formatContent() and parseDictionary() but not calling them anywhere). And then the ones that do affect execution flow second. Personally, that just feels like unnecessary extra paperwork to me. I would rather it be merged all at once, and then we do release candidates instead of a full release.

I'm not in a hurry. I'm happy to allow you all the time you need to review :)

k00ni commented 1 year ago

Personally, that just feels like unnecessary extra paperwork to me. I would rather it be merged all at once, and then we do release candidates instead of a full release.

Alright. If there are no further objections, we can merge it all at once.

How is your timetable for this PR?

CC @j0k3r @smalot

k00ni commented 11 months ago

@GreyWyvern A quick follow up: Are you planning any updates on this one in the near future? I won't have that much time in the next weeks/months most likely. I remember your suggestion to "throw further PDFs" at the code.

I suggest the following next steps:

You finalize your work
You mark the PR as ready to review
I will merge it (if there are no objections)
I will create a release candidate, so our community can start to test it

This way your work gets out to more people and we can observe, if there are any remaining bugs in your code. I can help organize further steps regarding releases or issues.

WDYT?

GreyWyvern commented 11 months ago

WDYT?

I think this is a good plan. I have actually found a couple more PDFs that PdfParser has trouble with since I last visited, but related to embedded images rather than text-extraction.

Putting out a release candidate will certainly help get the new code more testing. The most important thing to find are PDFs that v2.7.0 parses "correctly" while this new version does not. That's where the most tweaking information will come from. :)

I will mark this PR as Ready for review.

k00ni commented 11 months ago

@j0k3r any objections? I would merge it right away and proceed as suggested.

j0k3r commented 11 months ago

@k00ni This is good to me

k00ni commented 11 months ago

Big thanks to @GreyWyvern and all commentators.

smalot / pdfparser