Closed Roy-Kid closed 3 years ago
Hi @Roy-Kid Appreciate your interest in the library. For tables that don't have vertical lines separating the columns, there are a couple of things that you can try
text
strategyYou may know more about the various table extraction strategies at https://github.com/jsvine/pdfplumber#table-extraction-settings Using the text
strategy for table extraction for the kind of PDF you have shared may not yield the best results though when applied on the full page as the other text can interfere. The best results can be received when proper cropping of the page is done.
For the cropping, you will have to come up with a logic to first identify where the tabular regions are in the PDF. Using pdfplumber
, one approach could be
rect
objects on a page and filter out the ones that are too small to be the header of a table.curve
object. This will act as the bottom of the table. (If not found, assume the bottom of the page to be the bottom of the table)For the page 3 in the PDF, the rect
and curve
objects would be and .
This is not perfect but covers majority of the tables in the PDF you shared.
Thanks for your kindly help! I wonder that as for as you know, is there a lib that can do this perfectly? If not, how hard to dev a lib with opencv? I need to dev a paper inspector software to mine the data automatically.
Hi, I want to extract the data sheet in the science-published paper, but it seems not to work due to the tables doesn't have a sperator. Here is the table looks like:
and I attach the test pdf: Bis(pyrrolidene) Schhiff Base Aluminum Comlexes as isoselective biased initiators for the controlled.pdf does this feature can be supported in the future?