jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.78k stars 678 forks source link

Extract table merged cells #979

Open John-Peter-R opened 1 year ago

John-Peter-R commented 1 year ago

Please describe, in as much detail as possible, your proposal and how it would improve your experience with pdfplumber. So while extracting tables from a pdf there are pdf which has mered cells in that case table extraction method fails to extract merged cells in a merged format . the quality of extracting merged cells need to be improved

samkit-jain commented 1 year ago

Hi @John-Peter-R Appreciate your interest in the library. Could you please provide an example PDF, the output you are getting and the output you expected?

John-Peter-R commented 1 year ago

Thanks for your response . The thing is I am finding a generic way to extract tables from pdf regardless of the tables structure a pdf may contain different merged cells . So researching on a generic way

samkit-jain commented 1 year ago

One thing that you could try for a generic way of handling merged cells from tables could be that

  1. Find a table.
  2. Reject all horizontal and vertical lines that don't span the table's width and height. That way, you'll discard all the horizontal and vertical lines that are part of a merged cell and instead of getting 2 cells, you'll get a single cell.

If my understanding of your requirement is incorrect, request you to provide additional information with examples.

Pk13055 commented 1 year ago

@John-Peter-R As far as I have tested, the library, in its current state, is already able to extract merged-cell, tables

yoursock commented 4 months ago

H2_AN202404251631316496_1.pdf here's an example, you can take the page 8 for a test. pic is here: image only 9 columns in this table, but extracted 15 columns instead. table is here: [[['持股5%以上股东、前10名股东及前10名无限售流通股股东参与转融通业务出借股份情况', None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None], ['股东名称\n(全称)', '', '期初普通账户、信用账户持', None, None, '', '', '期初转融通出借股份', None, None, None, '', '', '期末普通账户、信用账户', None, None, '', '', '期末转融通出借股份', None, None, None, ''], [None, None, '股', None, None, None, None, '且尚未归还', None, None, None, None, None, '持股', None, None, None, None, '且尚未归还', None, None, None, None], [None, '数量合计', None, '', '占总股本的', '', '', '数量', '', '', '占总股本', '', '数量合计', None, '', '占总股本', '', '', '数量', '', '', '占总股本', ''], [None, None, None, None, '比例', None, None, '合计', None, None, '的比例', None, None, None, None, '的比例', None, None, '合计', None, None, '的比例', None], ['华润东阿阿\n胶有限公司', '151,351,731', None, '23.50%', None, None, '0', None, None, '0.00%', None, None, '151,351,731', None, '23.50%', None, None, '0', None, None, '0.00%', None, None], ['香港中央结\n算有限公司', '72,926,439', None, '11.32%', None, None, '0', None, None, '0.00%', None, None, '63,067,676', None, '9.79%', None, None, '0', None, None, '0.00%', None, None], ['华润医药投\n资有限公司', '57,935,116', None, '9.00%', None, None, '0', None, None, '0.00%', None, None, '57,935,116', None, '9.00%', None, None, '0', None, None, '0.00%', None, None], ['中国工商银\n行股份有限\n公司-中欧\n医疗健康混\n合型证券投\n资基金', '11,823,465', None, '1.84%', None, None, '0', None, None, '0.00%', None, None, '21,508,141', None, '3.34%', None, None, '0', None, None, '0.00%', None, None], ['中国建设银\n行股份有限\n公司-工银\n瑞信前沿医\n疗股票型证\n券投资基金', '10,000,022', None, '1.55%', None, None, '0', None, None, '0.00%', None, None, '11,300,020', None, '1.75%', None, None, '0', None, None, '0.00%', None, None], ['中国农业银\n行股份有限\n公司-易方\n达消费行业\n股票型证券\n投资基金', '12,798,173', None, '1.99%', None, None, '0', None, None, '0.00%', None, None, '8,887,373', None, '1.38%', None, None, '0', None, None, '0.00%', None, None], ['张弦', '8,232,033', None, '1.28%', None, None, '0', None, None, '0.00%', None, None, '8,232,033', None, '1.28%', None, None, '0', None, None, '0.00%', None, None], ['中国农业银\n行股份有限\n公司-嘉实\n新兴产业股\n票型证券投\n资基金', '7,113,293', None, '1.10%', None, None, '0', None, None, '0.00%', None, None, '7,677,893', None, '1.19%', None, None, '0', None, None, '0.00%', None, None], ['中国农业银\n行股份有限\n公司-嘉实\n核心成长混\n合型证券投\n资基金', '5,824,900', None, '0.90%', None, None, '0', None, None, '0.00%', None, None, '6,206,300', None, '0.96%', None, None, '0', None, None, '0.00%', None, None], ['中国农业银\n行股份有限\n公司-中证\n500交易型开\n放式指数证\n券投资基金', '2,514,400', None, '0.39%', None, None, '730,900', None, None, '0.11%', None, None, '5,124,095', None, '0.80%', None, None, '513,900', None, None, '0.08%', None, None]], [['股东名称'], ['(全称)']]]

jsvine commented 3 months ago

Hi @yoursock, running page.to_image().debug_tablefinder(...), you'll see that there are some hidden lines in the header:

tmp

You can use some of the strategies described here to deal with this issue: