smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.42k stars 538 forks source link

layout Maintain (as best as possible) the original physical layout #193

Open TnCoders opened 6 years ago

TnCoders commented 6 years ago

Hi, Can you tell me how can i Maintain (as best as possible) the original physical layout of the text. So keep the Tabulation etc.

antwal commented 5 years ago

Hi, i have a problem with a pdf, not return lines of valid table with tabulation but return lines with whitespaces

rubenvanerk commented 4 years ago

For me, changing this value helped separating columns: https://github.com/smalot/pdfparser/blob/7f2d319eab7c5b198611cf1a3de13e0ac1dd8288/src/Smalot/PdfParser/PDFObject.php#L287

k00ni commented 4 years ago

Hi @rubenvanerk, what exactly has be changed here?

The code $text .= ' '; already exists in the current master branch: https://github.com/smalot/pdfparser/blob/master/src/Smalot/PdfParser/PDFObject.php#L287

rubenvanerk commented 4 years ago

@k00ni Ideally it would be nice to programmatically modify this value.

When parsing this file, the fourth line would result in: 7 1 NZL/12 1988 Kent, Steven New Zealand 0:45,13 0.0 Q

It is difficult to separate the name Kent, Steven from the country New Zealand. Now when I change the horizontal offset to for example "\t": 7---1---NZL/12---1988---Kent, Steven---New Zealand---0:45,13---0.0---Q Note: I replaced "\t" with --- in this example for demonstration purposes.

It suddenly becomes a lot easier to separate the columns.

Overall I think there should be some options to modify the behavior of the parser. I've also seen some issues (https://github.com/smalot/pdfparser/issues/132, https://github.com/smalot/pdfparser/issues/82) recommending manually changing this value: https://github.com/smalot/pdfparser/blob/master/src/Smalot/PdfParser/Font.php#L341 This has already been suggested here: https://github.com/smalot/pdfparser/issues/259

k00ni commented 4 years ago

I can't take care of it right now, sorry.

But i find the idea good to override the value of getFontSpaceLimit as mentioned in #259. If someone wants to suggest something via a pull request i would be happy to help finding a solution.