jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

problem when extracting table without horizontal line #339

Closed lijuanLin closed 3 years ago

lijuanLin commented 3 years ago

What are you trying to do?

i am trying to extract tables from pdf files, but if without horizontal line, the content of the first row will be missing.

What code are you using to do it?

import pdfplumber pdf = pdfplumber.open("0.pdf") p = pdf.pages[5] ts = {"vertical_strategy": "lines", "horizontal_strategy": "lines"} tables = p.extract_tables(table_settings=ts) print(tables[0][0])

PDF file

http://static.cninfo.com.cn/finalpage/2021-01-22/1209162828.PDF

Expected behavior

['23', '华泰柏瑞基金管理有限\n公司', '华泰柏瑞中证红利低波动交易型\n开放式指数证券投资基金', 'D890156894', '800', '1,147', '5,528.54', 'A']

Actual behavior

ts = {"vertical_strategy": "lines", "horizontal_strategy": "lines"} output: ['24', '华泰柏瑞基金管理有限\n公司', '华泰柏瑞MSCI中国A股国际通交\n易型开放式指数证券投资基金联\n接基金', 'D890151501', '800', '1,147', '5,528.54', 'A']

ts = {"vertical_strategy": "lines", "horizontal_strategy": "text"}
output: ['华泰柏瑞基金管理有限', '华泰柏瑞中证红利低波动交易型', '', '', '', '']

ts = {"vertical_strategy": "lines", "horizontal_strategy": "text","intersection_y_tolerance": 15} output: ['华泰柏瑞基金管理有限', '华泰柏瑞中证红利低波动交易型', '', '', '', '']

i am new to pdfplumber, hope you can help me solve it.

thank you in advance!

jsvine commented 3 years ago

Hi @lijuanLin, and thanks for your interest in this library. In cases like this, I typically calculate the topmost position of all available lines to determine where the top horizontal line should be, and then pass that positioning via the explicit_horizontal_lines setting. For instance, for your example:

tables = p.extract_tables(table_settings={
    "vertical_strategy": "lines",
    "horizontal_strategy": "lines",
    "explicit_horizontal_lines": [ min(x["top"] for x in p.edges) ],
})
 print(tables[0][0])

(Here we use p.edges instead of p.lines because the things that look like lines in your PDF are actually just very thin rectangles, which p.edges will separate into their constituent lines.)

As long as the only lines/edges/rectangles on the page belong to the table, this should work. But if there are additional lines/edges/rectangles on the page, then you will have to adjust the approach slightly to exclude them.

lijuanLin commented 3 years ago

hi, jsvine, thanks for your help, your code works well for me, i will close this issue, thank you.