Closed lijuanLin closed 3 years ago
Hi @lijuanLin, and thanks for your interest in this library. In cases like this, I typically calculate the topmost position of all available lines to determine where the top horizontal line should be, and then pass that positioning via the explicit_horizontal_lines
setting. For instance, for your example:
tables = p.extract_tables(table_settings={
"vertical_strategy": "lines",
"horizontal_strategy": "lines",
"explicit_horizontal_lines": [ min(x["top"] for x in p.edges) ],
})
print(tables[0][0])
(Here we use p.edges
instead of p.lines
because the things that look like lines in your PDF are actually just very thin rectangles, which p.edges
will separate into their constituent lines.)
As long as the only lines/edges/rectangles on the page belong to the table, this should work. But if there are additional lines/edges/rectangles on the page, then you will have to adjust the approach slightly to exclude them.
hi, jsvine, thanks for your help, your code works well for me, i will close this issue, thank you.
What are you trying to do?
i am trying to extract tables from pdf files, but if without horizontal line, the content of the first row will be missing.
What code are you using to do it?
import pdfplumber pdf = pdfplumber.open("0.pdf") p = pdf.pages[5] ts = {"vertical_strategy": "lines", "horizontal_strategy": "lines"} tables = p.extract_tables(table_settings=ts) print(tables[0][0])
PDF file
http://static.cninfo.com.cn/finalpage/2021-01-22/1209162828.PDF
Expected behavior
['23', '华泰柏瑞基金管理有限\n公司', '华泰柏瑞中证红利低波动交易型\n开放式指数证券投资基金', 'D890156894', '800', '1,147', '5,528.54', 'A']
Actual behavior
ts = {"vertical_strategy": "lines", "horizontal_strategy": "lines"} output: ['24', '华泰柏瑞基金管理有限\n公司', '华泰柏瑞MSCI中国A股国际通交\n易型开放式指数证券投资基金联\n接基金', 'D890151501', '800', '1,147', '5,528.54', 'A']
ts = {"vertical_strategy": "lines", "horizontal_strategy": "text"}
output: ['华泰柏瑞基金管理有限', '华泰柏瑞中证红利低波动交易型', '', '', '', '']
ts = {"vertical_strategy": "lines", "horizontal_strategy": "text","intersection_y_tolerance": 15} output: ['华泰柏瑞基金管理有限', '华泰柏瑞中证红利低波动交易型', '', '', '', '']
i am new to pdfplumber, hope you can help me solve it.
thank you in advance!