smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.42k stars 538 forks source link

No line endings #175

Open kurdtpage opened 7 years ago

kurdtpage commented 7 years ago

The output of pdfparser is a string that is 1 long line of text. There are no line endings (CR, LF, \r, \n, etc.) even when there are clear line terminations in the PDF

am3000 commented 6 years ago

+1

SteveThePest commented 6 years ago

+1

luigif commented 6 years ago

This happens to me with PDF files generated with MS Word. A dirty fix is to change a line in the function getText of the Object class.

if ($current_position_tm['y'] !== false) {
  $delta = abs(floatval($y) - floatval($current_position_tm['y']));
    if ($delta > 10) {
      $text .= "\n";
    }
}

After some debugging $delta was sometime 0, sometime >7, so changing the test to ($delta > 7) correctly adds the newlines.

I imagine this is due to specific font issues so the correct number might vary and this is not going to be a permanent fix, but it might help you in converting word-generated pdfs.