jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

how to extrat table in the picture #323

Closed a417886 closed 3 years ago

a417886 commented 3 years ago

how to extrat table in the picture,I want to use my bbox

samkit-jain commented 3 years ago

Hi @a417886 Appreciate your interest in the library. I don't see any attached image. Request you to kindly edit your message with more details and a PDF.

a417886 commented 3 years ago

I have used the OCR engine to get the texts and bbox of my table(picture,not pdf),I want to recognize the table strcture.I'm not sure whether pdfplumber can be done.

samkit-jain commented 3 years ago

pdfplumber does not work on scanned PDFs. If you are able to get the coordinates of every character as well from your OCR engine, you can perhaps try the following:

  1. Add a text layer on top of the scanned PDF based on the coordinates provided by OCR.
  2. Crop the PDF per the bounding box of the table given by OCR.
  3. Run table extraction using pdfplumber using the "text" strategy.
    {
        "horizontal_strategy": "text",
        "vertical_strategy": "text"
    }