pdfminer / pdfminer.six

Community maintained fork of pdfminer - we fathom PDF
https://pdfminersix.readthedocs.io
MIT License
5.82k stars 921 forks source link

extract text from rotated pdf does not work as expected #592

Open jugmac00 opened 3 years ago

jugmac00 commented 3 years ago

Bug report

When I extract text from a rotated pdf, I get single chars with tons of control characters mixed, where I would expect just readable text.

from pdfminer.high_level import extract_text
print(bytes(extract_text("test.pdf"), "utf-8"))

gives

b')\ns\nr\ne\ns\nu\n6\n1\n(\n \n\n \n\nX\nL\nN\nh\n\n \n\nt\ni\n\nw\n\n \n\n \nr\ne\nv\nr\ne\nS\ne\nc\nn\ne\nc\nL\n\ni\n\n)\n\nM\nL\n(\n \n\nA\nE\nM\nF\n-\nQ\n\nI\n \n\nI\n\nS\nP\nA\n\n0\n-\n2\n0\n0\n8\n1\ne\nc\nn\ne\nc\nL\n\n \n\ni\n\n\x0c'

compared to...

from PyPDF2 import PdfFileReader
reader = PdfFileReader("test.pdf")
page = reader.getPage(0)
print(bytes(page.extractText(), "utf-8"))

gives

b'Licence 18002-0\nAPIS IQ-FMEA (LM)\nLicence Server with NLX (16 users)\n'

I converted the output to bytes so it is easier to read.

test.pdf

blackelk commented 3 years ago

+1