jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.48k stars 658 forks source link

when I try to use extract_words() ,can't get some text #956

Open fangjiyuan opened 1 year ago

fangjiyuan commented 1 year ago

Describe the bug

when I try to use extract_words() ,can't get some text ,for example then phonenumber

Have you tried repairing the PDF?

it not work.

PDF file

* https __order.crm.hcp.telecom.sd_rest_order_crm_BSS3_prd_receipt_2023_0718_730017088912.pdf

Environment

Additional context

Add any other context/notes about the problem here.

cmdlineluser commented 1 year ago

Hi @fangjiyuan - can you perhaps show what should be extracted exactly?

I see one long number in the text:

>>> [ word for word in words if '193' in word['text'] ]
[{'text': '19306498777',
  'x0': 339.75,
  'x1': 389.25,
  'top': 143.61997000000008,
  'doctop': 143.61997000000008,
  'bottom': 152.61997000000008,
  'upright': True,
  'direction': 1}]

But I don't understand the language to know if that is the phone number or not.

fangjiyuan commented 1 year ago

i am sorry about that ,i try to save the one of the pdf,may be it change the pdf's format.can u have a look this pdf example .what i want to get is '19306498777'.

https __order.crm.hcp.telecom.sd_rest_order_crm_BSS3_prd_receipt_2023_0718_730017088912.pdf @cmdlineluser

cmdlineluser commented 1 year ago

Hm, ok well I don't see a number in this new example:

But the red outlined area seems to be present in .extract_text()?

Screen Shot 2023-08-04 at 17 07 59
>>> print(page.extract_text())
防范打击通讯信息诈骗告知书
为进一步加大防范、治理通讯信息诈骗工作力度,加强开卡、用卡管控,特告知如下:
1.贩卖电话卡是违法行为,任何人不得将本人的电话卡转卖、转借、转租给他人,如将号码用于通信诈骗等违法活动,依照
《中华人民共和国刑事诉讼法》相关规定,公安机关将以帮助信息网络犯罪活动罪严厉处理,并纳入失信黑名单,同时需承担相
应法律责任。
2.用户存在以下通信异常疑似诈骗的,暂停通信服务:
(1)开卡后漫游至电信网络诈骗高危地且通信行为异常的;
(2)经他人投诉有诈骗、骚扰行为,一经核实的;
(3)频繁换机插卡或频繁补换卡;
(4)公安机关提供的涉案或高风险人员开办的号卡。
fangjiyuan commented 1 year ago

Hm, ok well I don't see a number in this new example:

But the red outlined area seems to be present in .extract_text()?

Screen Shot 2023-08-04 at 17 07 59
>>> print(page.extract_text())
防范打击通讯信息诈骗告知书
为进一步加大防范、治理通讯信息诈骗工作力度,加强开卡、用卡管控,特告知如下:
1.贩卖电话卡是违法行为,任何人不得将本人的电话卡转卖、转借、转租给他人,如将号码用于通信诈骗等违法活动,依照
《中华人民共和国刑事诉讼法》相关规定,公安机关将以帮助信息网络犯罪活动罪严厉处理,并纳入失信黑名单,同时需承担相
应法律责任。
2.用户存在以下通信异常疑似诈骗的,暂停通信服务:
(1)开卡后漫游至电信网络诈骗高危地且通信行为异常的;
(2)经他人投诉有诈骗、骚扰行为,一经核实的;
(3)频繁换机插卡或频繁补换卡;
(4)公安机关提供的涉案或高风险人员开办的号卡。

I can get this text too. But the page 2 of pdf contains phone number can't be find in the red outlined area . 截图_选择区域_20230805091237

cmdlineluser commented 1 year ago

Ah okay. So the problem is on page 2 of the updated PDF.

Screen Shot 2023-08-05 at 12 03 47

Yes, I get the same behaviour.

It appears none of the numbers inside the [] are detected.

>>> page2 = pdf.pages[1]
>>> print(page.extract_text())
中国电信号码优享业务协议
甲方(用户): (以下简称:甲方)
乙 方:中国电信股份有限公司 分公司(以下简称:乙方)
鉴于甲、乙双方已经签订《中国电信用户入网协议》,甲方基于对乙方移动通信服务的了解和需求,自愿选择使用乙方的号
码优享业务。为维护双方权益,根据相关法律、法规的规定,在平等、自愿、公平、诚实信用的基础上,甲乙双方就号码优享业
务及相关事宜达成以下协议,共同遵照执行。
一、甲方自愿 选择 套餐、 办理号码优享业务,并按照该套餐以及号码优享业务的规则享有相应的权利、
承担相应的义务。
二、就甲方办理的号码优享业务,甲方有权使用乙方提供的优享号码[ ], 并承诺接受以下业务规则:
1. 预存话费:甲方于本协议生效之日 预存[ ]元 至优享号码的通信费账户,该费用仅用于抵扣本协议约定的优享号

That helps with debugging things - thank you.

jsvine commented 1 year ago

FWIW, it seems that the same text is not copy/paste-able from a standard PDF viewer into a plain text editor. I haven't examined the potential cause closely, but this suggests the issue might not be solvable via pdfplumber.

It's possible (but just a guess) that this might be caused by glyph remappings, c.f., https://github.com/jsvine/pdfplumber/discussions/851#discussioncomment-5556104