Closed GreyWyvern closed 11 months ago
Hrms. It seems all the Windows PHP checks are failing because the Fileinfo
functions haven't been installed. They're installed by default on Linux, but the Installation manual says: 'Windows users must include the bundled php_fileinfo.dll
DLL file in php.ini
to enable this extension.' So Windows users have the DLL file, but I guess it's just not enabled in a "default" install?
Edit: This has been resolved by using a different method to detect whether a content stream is binary and can be safely ignored.
For "PHPUnit coverage (PHP 7.4)" it is failing an assertion where it tests to see if PDFParser's memory usage is higher than a checked value, but with my edits, the memory usage is actually lower than it expects, so that's probably actually a win? :D
Edit: This has been resolved by checking for a fixed numerical value for memory use above the $baselineMemory
instead of a multiple of it in the ParserTest::testRetainImageContentImpact()
method.
Edit: It's very possible that the ~~ Fileinfo
MIME-typemb_detect_encoding()
check is what's reducing the memory use since non-document-stream data is rejected before the various interpreting functions in PHPObject.php and Font.php tries to work on it. As an example, There are 4 (I think?) streams that get passed to getSectionsText()
from 12249.pdf in #458 and one of them is actually binary JPEG data.~~
@GreyWyvern Thank you very much for all your work. I will see how I manage to give more feedback soon, but I hope that our community has time to comment too. Surprisingly I can't mark this PR as draft.
However, since it parses document stream data in a completely new way, what I'm most interested in is whether or not it causes new errors in PDFs that aren't in the test suite. So I hope quite a few people decide to test drive this.
Can you elaborate a bit about which kind of PDFs are not there yet? We could "reproduce" some missing things by using unit tests instead.
Can you elaborate a bit about which kind of PDFs are not there yet? We could "reproduce" some missing things by using unit tests instead.
I just meant unknown PDFs "in-the-wild" in general. I run it on my own org's collection of PDFs (~400) for searching and it works fine. Almost all of them worked in v2.7.0 too, but those are PDFs all generated by one entity. There's just no telling, with a change this large, if there might be a PDF out there that works in v2.7.0 but not with this PR.
We can't really tell without people actually using it, so I think putting out a release candidate is a very good idea.
I just meant unknown PDFs "in-the-wild" in general. I run it on my own org's collection of PDFs (~400) for searching and it works fine. Almost all of them worked in v2.7.0 too, but those are PDFs all generated by one entity. There's just no telling, with a change this large, if there might be a PDF out there that works in v2.7.0 but not with this PR.
I created a few PDFs using the following tools:
I found in the Chromium one that there are too many white spaces added in the result of getText
. I marked it: https://github.com/smalot/pdfparser/pull/634/files#diff-50498d428341deceb2ce9d29159c86631d951c9260817d46c4f333fe7feecfabR311-R312
Does this help? I am not sure if that approach gonna work, but we could generate more PDFs using other PDF generators and check their results.
Does this help?
Oh yeah, this helps a lot! Thanks! Definitely some of the numbers need tweaking. I'll see if I can generate a couple myself.
I feel like this is getting close, but we should still keep throwing more PDFs at it for a while.
What other PDF generators are worth using to generate PDFs?
This latest commit (5a3a39b) solves a spacing issue I found with a PDF generated by "macOS Version 10.14.6 (Build 18G3020) Quartz PDFContext" but I don't have macOS to generate such a file for /samples/. If someone could do that, especially one with large font-sizes (like 20pt+) several newlines between titles and paragraphs and superscript 'th' or 'rd' as in 4th or 3rd, etc. that would be appreciated!
Wow this is truly 2.0. If you are looking for PDFs created in different ways, I recommend public documents (Laws, regulations, announcements, etc.). Since I use PDFparser for these, I can guarantee that all of them are created in completely different ways that you cannot imagine. Especially public announcements are a complete mess from different sources. Moreover, they are not the subject of any copyright. This makes it possible to add more complex and diverse samples as tests.
@Reqrefusion thank you for the hint and its good to see you around. This update could definitely be a version 2.0, but we don't have to increase major version, which is great, because all users using ^2
will get it and don't have to change their composer.json
.
I made a few adaptions and additions:
grouped-by-generator
Hope it helps.
@GreyWyvern I took another look at some of your new functions. Can you mark all functions @internal
which are only meant for internal use?
Added further PDFs, generated by generators we might not cover yet. Also added tests for each of these PDFs. I let a few tests fail to outline certain areas in question (images + encoding).
- Two tests will fail because of images, which were not found (like https://github.com/smalot/pdfparser/pull/634/files#diff-4a0191f73e0276217a40bb93e3ce61b37d440b108b90f6d6e61f5bf53d00cb08R140-R143)
I am reasonably certain I haven't touched any of the image extraction code. And yes, these two tests also fail (expected: 2, actual: 0) in the currently released v2.7.0. They can probably be removed or worked into another image-focused PR.
- One test will fail because of some decoding/encoding issue (I assume) - https://github.com/smalot/pdfparser/pull/634/files#diff-4a0191f73e0276217a40bb93e3ce61b37d440b108b90f6d6e61f5bf53d00cb08R227
Ah! This is due to an oversight in my previous PR #627 where it does not check far enough ahead for escaped backslashes. The PDF string in the document at this point is '\000\\\000
'.
Current:
/**
* Decode string with octal-decoded chunks.
*/
public static function decodeOctal(string $text): string
{
$parts = preg_split('/(?<!\\\\)(\\\\[0-7]{1,3})/s', $text, -1, \PREG_SPLIT_NO_EMPTY | \PREG_SPLIT_DELIM_CAPTURE);
$text = '';
Fixed:
/**
* Decode string with octal-decoded chunks.
*/
public static function decodeOctal(string $text): string
{
$parts = preg_split('/(?<![^\\\\]\\\\)(\\\\[0-7]{1,3})/s', $text, -1, \PREG_SPLIT_NO_EMPTY | \PREG_SPLIT_DELIM_CAPTURE);
$text = '';
I can bundle it into this PR, or I can make a new PR for it if you like. I just slipped the fix into this PR. ;)
Hope it helps.
Sure, thanks! I will take a look at your other notes (unnecessary tabs and such) and see if they're fixable.
@GreyWyvern I took another look at some of your new functions. Can you mark all functions
@internal
which are only meant for internal use?
I created two major new functions for this PR: formatContent()
and parseDictionary()
. I actually think both could be useful for end users to know about, but OTOH, you do need the raw PDF document stream output in order to use either of them, so any use would be pretty niche. I'll rely on your judgement @k00ni, if you think they should be marked as @internal
I'll do that.
@GreyWyvern
I'll rely on your judgement @k00ni, if you think they should be marked as
@internal
I'll do that.
Please only mark the functions you added in this PR with @internal
. If you are unsure, leave it as it is. I would like to keep our public API as small as possible for now or at least not let it expand it uncontrolled.
About the images: Just remove test-code then, if you didn't change any image related code.
I thought about the size of this PR and would like to know, if there is a reasonable way to split this PR up, so we can start to integrate it iteratively? There is no rush here, but on the other hand there is only feedback if people can try it out in the wild (using release candidates). It is up to you @GreyWyvern, its just a suggestion.
Please only mark the functions you added in this PR with
@internal
. If you are unsure, leave it as it is. I would like to keep our public API as small as possible for now or at least not let it expand it uncontrolled.
Will do! These functions are not really something a regular user would use, so @internal
makes enough sense.
About the images: Just remove test-code then, if you didn't change any image related code.
Removed.
I thought about the size of this PR and would like to know, if there is a reasonable way to split this PR up, so we can start to integrate it iteratively? There is no rush here, but on the other hand there is only feedback if people can try it out in the wild (using release candidates). It is up to you @GreyWyvern, its just a suggestion.
Well... the main change of function is in the two functions formatContent()
and getSectionsText()
. Especially with the latter, this is an all-or-nothing switch. Without the new way the document stream is formatted by these two functions, the rest of the changes (especially getTextArray()
) won't really work.
The BOM-related changes in Font.php, and all the changes in Page.php (except removing lines 403-404) are really the only changes that are essentially entirely unrelated to the meat of this PR. I will revert my change to Font::decodeOctal()
in prep for the other PR I've made.
I think at most I would be able to split this into two PRs, the first of which would be all the additions that don't actually change execution flow (like adding formatContent()
and parseDictionary()
but not calling them anywhere). And then the ones that do affect execution flow second. Personally, that just feels like unnecessary extra paperwork to me. I would rather it be merged all at once, and then we do release candidates instead of a full release.
I'm not in a hurry. I'm happy to allow you all the time you need to review :)
Personally, that just feels like unnecessary extra paperwork to me. I would rather it be merged all at once, and then we do release candidates instead of a full release.
Alright. If there are no further objections, we can merge it all at once.
How is your timetable for this PR?
CC @j0k3r @smalot
@GreyWyvern A quick follow up: Are you planning any updates on this one in the near future? I won't have that much time in the next weeks/months most likely. I remember your suggestion to "throw further PDFs" at the code.
I suggest the following next steps:
This way your work gets out to more people and we can observe, if there are any remaining bugs in your code. I can help organize further steps regarding releases or issues.
WDYT?
WDYT?
I think this is a good plan. I have actually found a couple more PDFs that PdfParser has trouble with since I last visited, but related to embedded images rather than text-extraction.
Putting out a release candidate will certainly help get the new code more testing. The most important thing to find are PDFs that v2.7.0 parses "correctly" while this new version does not. That's where the most tweaking information will come from. :)
I will mark this PR as Ready for review.
@j0k3r any objections? I would merge it right away and proceed as suggested.
@k00ni This is good to me
Big thanks to @GreyWyvern and all commentators.
Okay, here it is! I fully expect this will take quite some time to review and merge and likely require many more commits before it's ready to go; please mark it as a draft if that fits better.
It fully passes all unit tests here (PHP 8.2.7), even a couple that were marked "linux only" from which I removed that criteria. Several existing unit test assertions have been altered simply because of the way the update now parses document stream data differently, and thus generates arrays of commands differently. As examples:
BT
commands (as well as several others) are now stored instead of discarded, and outside of(strings)
and<<dictionaries>>
whitespace is normalized.However, since it parses document stream data in a completely new way, what I'm most interested in is whether or not it causes new errors in PDFs that aren't in the test suite. So I hope quite a few people decide to test drive this.
PHPObject.php
This is a major update to the PHPObject.php file. Where previously PdfParser would attempt to gather document stream data using a series of multiline regular expressions focusing on
BT ... ET
blocks, this updatechanges the behaviour ofadds a new functioncleanContent()
formatContent()
that considers the entire document stream. It takes the following steps:(strings)
<<dictionaries>>
<<dictionaries>>
(strings)
as\n
and\r
and restore them as wellBy using this system, it is then much easier to examine and parse the document stream in a line-by-line manner, instead of multiline PCRE extraction.
getSectionsText()
has been updated to do just this, stepping through the output ofcleanContent()
formatContent()
line-by-line and returning an array of only the relevant commands needed to position and display text.The guts of
getText()
have been moved togetTextArray()
to reduce code duplication.getTextArray()
now takes into account both the current graphics positioncm
and the scaling factors of the text matrixTm
when adding \n, \t and space " " whitespace for positioning. Positioning is only taken into account at the point of inserting text, rather than whenever aTm
orTd
/TD
is found.getTextArray()
now also treatsQ
andq
as a stack of stored states rather than a single stored state. Both fontTf
and graphics positioningcm
are stored.The presence of
ActualText
BDC
commands is also taken into account and the contents of theActualText
will replace the formerly output text in both content and position. This requires the newparseDictionary()
method to reliably extract such commands as well as any others PdfParser may take into account in the future.Font.php
decodeText()
in Font.php now takes into account the currenttext matrixfont size and scale when considering whether or not to define strings of text as "words" that require spaces between them.In
decodeContent()
in Font.php add a check to see if the string to decode has the UTF-16BE BOM and decode it directly as Unicode if so.Page.php
In Page.php remove the addition of a "fake" BT command as the content stream now records them.
Add a check to see if there are remaining texts to use from
PDFObject::getTextArray()
before proceeding ingetDataTm()
which prevents "undefined array key" PHP errors.Also prevent
ET
commands from resetting the fontTf
as PR #629 did forBT
commands.Issues affected
Resolves #219. Resolves #353. Resolves #398. Current v2.7.0 fixes text direction, but this PR fixes all spacing issues. Resolves #464. Removes duplicated text by examining
ActualText
commands. Resolves #474. Resolves #508. Resolves #528. Fixes spacing issues. Resolves #537. Resolves #564. Current v2.7.0 fixes text extraction, but this PR fixes spacing issues. Resolves #568. Fixes spacing issues. Resolves #575. Resolves #576. Resolves #585. Resolves #608. Fixes headings by taking into account the graphics positioncm
. Resolves #628. Resolves #637.[^1]: This prevents non-document-stream data from being passed to
cleanContent()
formatContent()
such as JPEG data in file '12249.pdf' from https://github.com/smalot/pdfparser/issues/458 [^2]: https://ia801001.us.archive.org/1/items/pdf1.7/pdf_reference_1-7.pdf - Appendix A; https://archive.org/download/pdf320002008/PDF32000_2008.pdf - Annex A