jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

How to extract tables in scientific articles #438

Closed Roy-Kid closed 3 years ago

Roy-Kid commented 3 years ago

Hi, I want to extract the data sheet in the science-published paper, but it seems not to work due to the tables doesn't have a sperator. Here is the table looks like: image image

and I attach the test pdf: Bis(pyrrolidene) Schhiff Base Aluminum Comlexes as isoselective biased initiators for the controlled.pdf does this feature can be supported in the future?

samkit-jain commented 3 years ago

Hi @Roy-Kid Appreciate your interest in the library. For tables that don't have vertical lines separating the columns, there are a couple of things that you can try

  1. Use the text strategy
  2. Explicitly specify the coordinates of the columns

You may know more about the various table extraction strategies at https://github.com/jsvine/pdfplumber#table-extraction-settings Using the text strategy for table extraction for the kind of PDF you have shared may not yield the best results though when applied on the full page as the other text can interfere. The best results can be received when proper cropping of the page is done.

For the cropping, you will have to come up with a logic to first identify where the tabular regions are in the PDF. Using pdfplumber, one approach could be

  1. Find all the rect objects on a page and filter out the ones that are too small to be the header of a table.
  2. For each object
    1. Crop the page and keep only the bottom portion leaving out the leftmost and rightmost areas as well.
    2. Find the first curve object. This will act as the bottom of the table. (If not found, assume the bottom of the page to be the bottom of the table)
    3. Crop the page again to remove out the portion below the bottom of the table.
    4. Run text based table extraction on the final cropped page.

For the page 3 in the PDF, the rect and curve objects would be image and image.

This is not perfect but covers majority of the tables in the PDF you shared.

Roy-Kid commented 3 years ago

Thanks for your kindly help! I wonder that as for as you know, is there a lib that can do this perfectly? If not, how hard to dev a lib with opencv? I need to dev a paper inspector software to mine the data automatically.

samkit-jain commented 3 years ago

camelot, tabula-py, and pdftables can be tried.