jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

Unable to extract text from table #431

Closed xhdavid closed 3 years ago

xhdavid commented 3 years ago

Describe the bug

The table is extracted correctly in visualization, but the text in the table is not extracted.

Code to reproduce the problem

The code is as follows:

pdf = pdfplumber.open(file_path) p0 = pdf.pages[0] table = p0.extract_tables() pd.DataFrame(table[1:],columns=page[0])

PDF file

Order.pdf

Actual behavior

image

Screenshots

image image

Environment

Additional context

I don't know what the problem is? Looking forward to your help

samkit-jain commented 3 years ago

Hi @xhdavid Appreciate your interest in the library. This PDF has the STSONG font and appears to be a duplicate of #332. If the text isn't recognised when you run pdfminer.six's pdf2txt as described in https://pdfminersix.readthedocs.io/en/latest/tutorial/commandline.html#pdf2txt-py, I would recommend you to raise an issue over at https://github.com/pdfminer/pdfminer.six/issues. If it is recognised, then please reopen this issue.

xhdavid commented 3 years ago

@samkit-jain Hi,Just tested with pdf2txt.py, the text can be recognized.

image

samkit-jain commented 3 years ago

@xhdavid That's weird. Can you tell me the steps on how you ran the script and how you installed pdfminer.six? When I ran the script on my machine, I am getting the result as

正常紧急交货日期 Delivery Date:窗口时间 Window Time:交货道口 Delivery Dock:物流协调员 Follow Up:YFAI电话 Telephone:YFAI传真 Fax:Supply Portal包装UCs零件数Quantity包装UCs零件数Quantity包装UCs零件数Quantity实际到货时间:YFAI发单人签字:供应商签字:Actual Delivery Time:YF Order Issuer Signature:Supplier Signature:1. 供应商交货后,请务必在要货单上签字,以确认实收数量,检查并带走YFAI收货单; YFAI保留开箱点数索赔权利。2. YFAI收货单是被YFAI承认的采购收货对帐凭证,请供应商妥善保存。承运商 Shipper:注:“打勾”为本车零件,未“打勾”为其他车辆零件。到车时间 Arrival Time: * 我已阅读YFAI的安全告知!离开时间 Departure Time:实收 Received批号/备注Lot/Remarks 包装合计:需求 Request系统号 SysCode:版本号 Version:No.零件号Item Code供应商零件号Supplier Ref.描述Description单位Unit单包装UC发货数 Issued供应商联系人 Contact:供应商电话 Telephone:备注 Remarks:供应商代码 Supplier Code:供应商名称 Supplier Name:供应商地址 Address:客户采购单号:ASN 号:创建时间 Create Time:YFAISH-MS04-07-01送货单Delivery Notes

which is the same as what pdfplumber is extracting