jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.1k stars 625 forks source link

Doesn't work for rotated page #848

Open Tobeabellwether opened 1 year ago

Tobeabellwether commented 1 year ago

Describe the bug

A clear and concise description of what the bug is. When I use page.extract_text() to extract text from a 90 degree rotated page, the results is just some garbled words

Code to reproduce the problem

Paste it here, or attach a Python file.

PDF file

Please attach any PDFs necessary to reproduce the problem.

If you need to redact text in a sensitive PDF, you can run it through JoshData/pdf-redactor.

Expected behavior

What did you expect the result should have been?

Actual behavior

What actually happened, instead?

Screenshots

If applicable, add screenshots to help explain your problem.

Environment

Additional context

Add any other context/notes about the problem here.

jsvine commented 1 year ago

Thanks for flagging this @Tobeabellwether. That makes sense, given the approach pdfplumber takes to extracting text. I think adding support for rotated pages would be a good addition to the library.

OrianeN commented 1 year ago

I have a similar issue where some parts of the text is 90 degrees rotated (in a portrait page):

image

Copy-pasting the text manually works fine, but the .extract_text() method returns it in reversed order and badly segmented:

OHW
A door-to-door polio vaccination
©
campaign in Yemen :otohP

I'll find a workaround but agree this would be a great new feature for this library !