Open jugmac00 opened 3 years ago
Bug report
When I extract text from a rotated pdf, I get single chars with tons of control characters mixed, where I would expect just readable text.
from pdfminer.high_level import extract_text print(bytes(extract_text("test.pdf"), "utf-8"))
gives
b')\ns\nr\ne\ns\nu\n6\n1\n(\n \n\n \n\nX\nL\nN\nh\n\n \n\nt\ni\n\nw\n\n \n\n \nr\ne\nv\nr\ne\nS\ne\nc\nn\ne\nc\nL\n\ni\n\n)\n\nM\nL\n(\n \n\nA\nE\nM\nF\n-\nQ\n\nI\n \n\nI\n\nS\nP\nA\n\n0\n-\n2\n0\n0\n8\n1\ne\nc\nn\ne\nc\nL\n\n \n\ni\n\n\x0c'
compared to...
from PyPDF2 import PdfFileReader reader = PdfFileReader("test.pdf") page = reader.getPage(0) print(bytes(page.extractText(), "utf-8"))
b'Licence 18002-0\nAPIS IQ-FMEA (LM)\nLicence Server with NLX (16 users)\n'
I converted the output to bytes so it is easier to read.
test.pdf
+1
Bug report
When I extract text from a rotated pdf, I get single chars with tons of control characters mixed, where I would expect just readable text.
gives
compared to...
gives
I converted the output to bytes so it is easier to read.
test.pdf