pyPdf.pdf.PageObject.extractText() incorrectly concatinates words across line break

ke1g commented 11 years ago

At least with the pdf I'm looking at, the TD operator is used to move from the end of one line to the start of another. This is ignored by extractText(), so if one line ends with the last letter of a word, and the next line begins with the first letter of a word, then these two characters are also immediately adjacent in the resulting text, producing a new "word" that is not present in the document.

A specific case I'm seeing is a line ending with "phase" is followed by a line beginning with "insufficiency", so what is included at that point in the resulting string is "phaseinsufficiency", a non-word that does not, in fact, occur in the document. I'm using the result in full text search, so this is problematic, in that a search for "phase" or for "insufficiency", or, in fact, for "phase insufficiency", will fail.

I have a patch (if needed) which adds "TD" to the operators extractText() processes, which checks to see if the y operand (operands[1]) is non-zero, whether text is non-empty, and whether text ends with a non-whitespace character. If all this is true, a newline gets appended to text. This works, and is sufficient to my needs.

Since this is a change in behavior, I have also added an argument to extractText() called split_on_y_change with a default value of False, making the default behavior the old behavior. One could do something similar for x changes and vertical languages, but I don't know enough about such languages to propose the details.

Let me know if you want my patch, and whether you can accept a unified diff somewhere, or whether you need a pull request.

Bill

mfenniak commented 11 years ago

Hi Bill. I'm no longer maintaining pyPdf, but the project has been forked as pyPdf2 and is being maintained under that new name. (https://github.com/knowah/PyPDF2/) Perhaps the new maintainer can help you out.

QCTW commented 5 years ago

Found the same issue in the PyPDF2 @@

lauramsfernandes commented 3 years ago

Me too. Did you find any solution?

mfenniak / pyPdf

pyPdf.pdf.PageObject.extractText() incorrectly concatinates words across line break #46