smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.37k stars 537 forks source link

completely different output for table data (2.7.0 vs 2.8.0) #674

Open andus4n opened 7 months ago

andus4n commented 7 months ago

Description:

I'm using this library for more than a year now and until version 2.8.0 i didn't have a single issue with it. after updating to 2.8.0 i'm getting a completely different output for the same pdf file. unfortunately, this output can't be parsed in order to extract the data i'm interested in.

PDF input

c0

Expected output & actual output

2.7.0 (this is ok and can easily be parsed)

c1

2.8.0 (this can't be parsed)

c2

Code

file_put_contents('./test2.dat', (new \Smalot\PdfParser\Parser())->parseFile('./invoice.pdf')->getText());

andus4n commented 7 months ago

nevermind, i found a logic to make it work with 2.8.0...but this should still be investigated.

k00ni commented 7 months ago

CC @GreyWyvern you may be interested in this.

GreyWyvern commented 7 months ago

Unfortunately this results from the new algorithm in 2.8.0 being more exact about spacing and line-feeds. It helps make normally extracted text from paragraphs better, but the "logic" of text in tables suffers. :|

You can see that 2.8.0 is putting newlines in the output exactly where it sees them, and when the document moves the cursor back up to the line above, but the next cell over, it also interprets this to be where a newline should be added.

It's only because 2.7.0 was very lenient with spacing (the newlines in the cells are not enough to trigger a newline in the output) that the resulting text appears more "logical". I'm not sure how this would be fixed definitively, but we could:

andus4n commented 7 months ago

Could we possibly detect if we're in a table? If so, we could change the spacing rules for text encountered in there. This is probably a long-shot though.

this sounds pretty good, but is it even possible with pdfs? also, i'd have a suggestion (a little bit off-topic): it'd be great if you could implement somekind of line-by-line stream (like a generator) for getText to not load all that stuff into memory at once.