jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

How can I extract table without left and right vertical border correctly,and the columns can not change in the extract_table #492

Closed youpengbo2018 closed 3 years ago

youpengbo2018 commented 3 years ago

2020.06月度数据(区).pdf image this is the page that I want to extract., but the page doesn't has the vertical edge. I use extract_table‘s vertical_strategy:text to let the system find the edge. Finally, It can extract the data I want, but It also ignores the blank of other columns. I want to get the table with csv file which have the same look as the picture(the table need to show me the blank. this is my code: mport pdfplumber import pandas as pd if name == 'main': list = [] with pdfplumber.open(r'F:\work\南京\2020.06月度数据(区).pdf') as pdf: page = pdf.pages[8] for table in page.extract_tables(table_settings={"vertical_strategy": "lines", "horizontal_strategy": "lines","keep_blank_chars":"False"}): tb = pd.DataFrame(table[1:], columns=table[0], index=None) print(tb) tb.to_csv(r'F:\work\南京\南京\test3.csv', index=False)

samkit-jain commented 3 years ago

Hi @youpengbo2018 Appreciate your interest in the library. You can use explicit_vertical_lines in combination with vertical_strategy=lines to explicitly specify the coordinates of the vertical line separators. You can use the following table extraction strategy as an example

{
    "vertical_strategy": "lines",
    "horizontal_strategy": "lines",
    "snap_tolerance": 7,
    "explicit_vertical_lines": [Decimal(p.width) * Decimal('0.07'), Decimal(p.width) * Decimal('0.93')],
}

What I have done in the explicit_vertical_lines is that I have specified coordinates of the first and last vertical line separators to be at 7% and 93% of the page's width. If the page was 100 units wide, the X coordinates for the first and last vertical line separators would be at 7 units and 93 units respectively.

Add from decimal import Decimal at the top for importing Decimal. The output will be page1

youpengbo2018 commented 3 years ago

thank you! it works