How can I extract table without left and right vertical border correctly,and the columns can not change in the extract_table

jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

MIT License

6.57k stars 659 forks source link

2020.06月度数据（区）.pdf this is the page that I want to extract., but the page doesn't has the vertical edge. I use extract_table‘s vertical_strategy:text to let the system find the edge. Finally, It can extract the data I want, but It also ignores the blank of other columns. I want to get the table with csv file which have the same look as the picture(the table need to show me the blank. this is my code： mport pdfplumber import pandas as pd if name == 'main': list = [] with pdfplumber.open(r'F:\work\南京\2020.06月度数据（区）.pdf') as pdf: page = pdf.pages[8] for table in page.extract_tables(table_settings={"vertical_strategy": "lines", "horizontal_strategy": "lines","keep_blank_chars":"False"}): tb = pd.DataFrame(table[1:], columns=table[0], index=None) print(tb) tb.to_csv(r'F:\work\南京\南京\test3.csv', index=False)

Hi @youpengbo2018 Appreciate your interest in the library. You can use explicit_vertical_lines in combination with vertical_strategy=lines to explicitly specify the coordinates of the vertical line separators. You can use the following table extraction strategy as an example

{
    "vertical_strategy": "lines",
    "horizontal_strategy": "lines",
    "snap_tolerance": 7,
    "explicit_vertical_lines": [Decimal(p.width) * Decimal('0.07'), Decimal(p.width) * Decimal('0.93')],
}

What I have done in the explicit_vertical_lines is that I have specified coordinates of the first and last vertical line separators to be at 7% and 93% of the page's width. If the page was 100 units wide, the X coordinates for the first and last vertical line separators would be at 7 units and 93 units respectively.

Add from decimal import Decimal at the top for importing Decimal. The output will be

jsvine / pdfplumber

How can I extract table without left and right vertical border correctly,and the columns can not change in the extract_table #492