when I try to use extract_words() ,can't get some text

fangjiyuan commented 1 year ago

Describe the bug

when I try to use extract_words() ,can't get some text ,for example then phonenumber

Have you tried repairing the PDF?

it not work.

PDF file

* https __order.crm.hcp.telecom.sd_rest_order_crm_BSS3_prd_receipt_2023_0718_730017088912.pdf

Environment

pdfplumber version: [0.10.2]
Python version: [3.7.3]
OS: [Linux]

Additional context

Add any other context/notes about the problem here.

cmdlineluser commented 1 year ago

Hi @fangjiyuan - can you perhaps show what should be extracted exactly?

I see one long number in the text:

>>> [ word for word in words if '193' in word['text'] ]
[{'text': '19306498777',
  'x0': 339.75,
  'x1': 389.25,
  'top': 143.61997000000008,
  'doctop': 143.61997000000008,
  'bottom': 152.61997000000008,
  'upright': True,
  'direction': 1}]

But I don't understand the language to know if that is the phone number or not.

fangjiyuan commented 1 year ago

i am sorry about that ,i try to save the one of the pdf,may be it change the pdf's format.can u have a look this pdf example .what i want to get is '19306498777'.

https __order.crm.hcp.telecom.sd_rest_order_crm_BSS3_prd_receipt_2023_0718_730017088912.pdf @cmdlineluser

cmdlineluser commented 1 year ago

Hm, ok well I don't see a number in this new example:

But the red outlined area seems to be present in .extract_text()?

>>> print(page.extract_text())
防范打击通讯信息诈骗告知书
为进一步加大防范、治理通讯信息诈骗工作力度，加强开卡、用卡管控，特告知如下：
1.贩卖电话卡是违法行为，任何人不得将本人的电话卡转卖、转借、转租给他人，如将号码用于通信诈骗等违法活动，依照
《中华人民共和国刑事诉讼法》相关规定，公安机关将以帮助信息网络犯罪活动罪严厉处理，并纳入失信黑名单，同时需承担相
应法律责任。
2.用户存在以下通信异常疑似诈骗的，暂停通信服务：
（1）开卡后漫游至电信网络诈骗高危地且通信行为异常的；
（2）经他人投诉有诈骗、骚扰行为，一经核实的；
（3）频繁换机插卡或频繁补换卡；
（4）公安机关提供的涉案或高风险人员开办的号卡。

fangjiyuan commented 1 year ago

Hm, ok well I don't see a number in this new example:

But the red outlined area seems to be present in .extract_text()?

>>> print(page.extract_text())
防范打击通讯信息诈骗告知书
为进一步加大防范、治理通讯信息诈骗工作力度，加强开卡、用卡管控，特告知如下：
1.贩卖电话卡是违法行为，任何人不得将本人的电话卡转卖、转借、转租给他人，如将号码用于通信诈骗等违法活动，依照
《中华人民共和国刑事诉讼法》相关规定，公安机关将以帮助信息网络犯罪活动罪严厉处理，并纳入失信黑名单，同时需承担相
应法律责任。
2.用户存在以下通信异常疑似诈骗的，暂停通信服务：
（1）开卡后漫游至电信网络诈骗高危地且通信行为异常的；
（2）经他人投诉有诈骗、骚扰行为，一经核实的；
（3）频繁换机插卡或频繁补换卡；
（4）公安机关提供的涉案或高风险人员开办的号卡。

I can get this text too. But the page 2 of pdf contains phone number can't be find in the red outlined area . 截图_选择区域_20230805091237

cmdlineluser commented 1 year ago

Ah okay. So the problem is on page 2 of the updated PDF.

Yes, I get the same behaviour.

It appears none of the numbers inside the [] are detected.

>>> page2 = pdf.pages[1]
>>> print(page.extract_text())
中国电信号码优享业务协议
甲方（用户）： （以下简称：甲方）
乙 方：中国电信股份有限公司 分公司（以下简称：乙方）
鉴于甲、乙双方已经签订《中国电信用户入网协议》，甲方基于对乙方移动通信服务的了解和需求，自愿选择使用乙方的号
码优享业务。为维护双方权益，根据相关法律、法规的规定，在平等、自愿、公平、诚实信用的基础上，甲乙双方就号码优享业
务及相关事宜达成以下协议，共同遵照执行。
一、甲方自愿 选择 套餐、 办理号码优享业务，并按照该套餐以及号码优享业务的规则享有相应的权利、
承担相应的义务。
二、就甲方办理的号码优享业务，甲方有权使用乙方提供的优享号码[ ]， 并承诺接受以下业务规则：
1. 预存话费：甲方于本协议生效之日 预存[ ]元 至优享号码的通信费账户，该费用仅用于抵扣本协议约定的优享号

That helps with debugging things - thank you.

jsvine commented 1 year ago

FWIW, it seems that the same text is not copy/paste-able from a standard PDF viewer into a plain text editor. I haven't examined the potential cause closely, but this suggests the issue might not be solvable via pdfplumber.

It's possible (but just a guess) that this might be caused by glyph remappings, c.f., https://github.com/jsvine/pdfplumber/discussions/851#discussioncomment-5556104

jsvine / pdfplumber