smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.42k stars 538 forks source link

Large White Space between words #604

Open Bhushannn opened 1 year ago

Bhushannn commented 1 year ago

Description:

I am getting large spaces in between the word and also large spaces within the words. Please help me to fix it

Demo output = "project for building a Fo od tech mobile application."

k00ni commented 1 year ago

You should checkout https://github.com/smalot/pdfparser/blob/master/doc/CustomConfig.md and play around with the parameters.

yosbeck commented 1 year ago

I have read everything on the forum on the white space issue and tried all recommended settings. setHorizontalOffset, setFontSpaceLimit etc. I have played with these settings both through the config settings and even directly in the code. Nothing seems to have any effect and the white spaces still appear in the output... I use regular English in PDF's made with an MS Word document saved as pdf. The font used in Word is Calibri 11. I use PdfParser v. 2.5.0 The white spaces appear at random places, between random letters. The same word exactly will be written correctly on many places and have a white space in another place. I can't detect any pattern that would explain why it should occur in the specific places that it does. Any other ideas?

k00ni commented 1 year ago

Sorry to hear that, so it seems to be another instance of white space related problems in the parser (of which we had a couple already issued). I have no further idea how to solve it.