Open TnCoders opened 6 years ago
Hi, i have a problem with a pdf, not return lines of valid table with tabulation but return lines with whitespaces
For me, changing this value helped separating columns: https://github.com/smalot/pdfparser/blob/7f2d319eab7c5b198611cf1a3de13e0ac1dd8288/src/Smalot/PdfParser/PDFObject.php#L287
Hi @rubenvanerk, what exactly has be changed here?
The code $text .= ' ';
already exists in the current master branch:
https://github.com/smalot/pdfparser/blob/master/src/Smalot/PdfParser/PDFObject.php#L287
@k00ni Ideally it would be nice to programmatically modify this value.
When parsing this file, the fourth line would result in:
7 1 NZL/12 1988 Kent, Steven New Zealand 0:45,13 0.0 Q
It is difficult to separate the name Kent, Steven
from the country New Zealand
. Now when I change the horizontal offset to for example "\t"
:
7---1---NZL/12---1988---Kent, Steven---New Zealand---0:45,13---0.0---Q
Note: I replaced "\t"
with --- in this example for demonstration purposes.
It suddenly becomes a lot easier to separate the columns.
Overall I think there should be some options to modify the behavior of the parser. I've also seen some issues (https://github.com/smalot/pdfparser/issues/132, https://github.com/smalot/pdfparser/issues/82) recommending manually changing this value: https://github.com/smalot/pdfparser/blob/master/src/Smalot/PdfParser/Font.php#L341 This has already been suggested here: https://github.com/smalot/pdfparser/issues/259
I can't take care of it right now, sorry.
But i find the idea good to override the value of getFontSpaceLimit
as mentioned in #259. If someone wants to suggest something via a pull request i would be happy to help finding a solution.
Hi, Can you tell me how can i Maintain (as best as possible) the original physical layout of the text. So keep the Tabulation etc.