jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

How to extract data from rectangles? #333

Closed mugiwara85 closed 3 years ago

mugiwara85 commented 3 years ago

Hi!

I have a pdf (I can't share it unfortunately). It contains multiple rectangles in cascaded style. Something like this:

In each rectangle is the text I need. I can find all rectangles on a given page like this: for page in range(pagecount): current_page = pdf.pages[page] print ("rectangles=", current_page.rects)

But how can I extract the text from them? extract_text() extracts text from the whole page, but I just need from the rectangles.

Thanks in advance!

samkit-jain commented 3 years ago

Hi @mugiwara85 Appreciate your interest in the library. If you have the coordinates of the rectangles, you can use the page.crop(...) method to crop the page and then run .extract_text(...) on the cropped page which will give you the text inside the rectangle.