completely different output for table data (2.7.0 vs 2.8.0)

andus4n commented 7 months ago

PHP Version: 7.4 / 8.2 (same output)
PDFParser Version: 2.7.0 vs 2.8.0

Description:

I'm using this library for more than a year now and until version 2.8.0 i didn't have a single issue with it. after updating to 2.8.0 i'm getting a completely different output for the same pdf file. unfortunately, this output can't be parsed in order to extract the data i'm interested in.

PDF input

Expected output & actual output

2.7.0 (this is ok and can easily be parsed)

2.8.0 (this can't be parsed)

Code

file_put_contents('./test2.dat', (new \Smalot\PdfParser\Parser())->parseFile('./invoice.pdf')->getText());

andus4n commented 7 months ago

nevermind, i found a logic to make it work with 2.8.0...but this should still be investigated.

k00ni commented 7 months ago

CC @GreyWyvern you may be interested in this.

GreyWyvern commented 7 months ago

Unfortunately this results from the new algorithm in 2.8.0 being more exact about spacing and line-feeds. It helps make normally extracted text from paragraphs better, but the "logic" of text in tables suffers. :|

You can see that 2.8.0 is putting newlines in the output exactly where it sees them, and when the document moves the cursor back up to the line above, but the next cell over, it also interprets this to be where a newline should be added.

It's only because 2.7.0 was very lenient with spacing (the newlines in the cells are not enough to trigger a newline in the output) that the resulting text appears more "logical". I'm not sure how this would be fixed definitively, but we could:

Offer a user setting that makes detection of newlines more like 2.7.0, however this would affect text outside of tables as well.
Could we possibly detect if we're in a table? If so, we could change the spacing rules for text encountered in there. This is probably a long-shot though.

andus4n commented 7 months ago

Could we possibly detect if we're in a table? If so, we could change the spacing rules for text encountered in there. This is probably a long-shot though.

this sounds pretty good, but is it even possible with pdfs? also, i'd have a suggestion (a little bit off-topic): it'd be great if you could implement somekind of line-by-line stream (like a generator) for getText to not load all that stuff into memory at once.

smalot / pdfparser