How to extract columns separately

jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

MIT License

6.74k stars 670 forks source link

How to extract columns separately #228

Closed lilygrier closed 4 years ago

lilygrier commented 4 years ago

My apologies if this is addressed elsewhere (somewhat new to deciphering documentation). I'm working with PDFs like these (http://www.fao.org/ag/locusts/common/ecg/2536/en/DL498e.pdf) that have text across two columns. When I try to extract text, it's blurring the columns into one. Would the solution be to crop the page down the middle and read in each side separately? If so, how would I determine the location of the middle of the page? Thanks so much!

jsvine commented 4 years ago

Hi @lilygrier, and thanks for your interest in this library. Cropping the page in half sounds like a reasonable approach. If the page is truly split down the exact middle, you should be able to determine the location via page.width // 2. If it's not exactly in the middle, hopefully it's in a fairly consistent place, in which case you could hardcode that x-value.