Open fangjiyuan opened 1 year ago
Hi @fangjiyuan - can you perhaps show what should be extracted exactly?
I see one long number in the text:
>>> [ word for word in words if '193' in word['text'] ]
[{'text': '19306498777',
'x0': 339.75,
'x1': 389.25,
'top': 143.61997000000008,
'doctop': 143.61997000000008,
'bottom': 152.61997000000008,
'upright': True,
'direction': 1}]
But I don't understand the language to know if that is the phone number or not.
i am sorry about that ,i try to save the one of the pdf,may be it change the pdf's format.can u have a look this pdf example .what i want to get is '19306498777'.
https __order.crm.hcp.telecom.sd_rest_order_crm_BSS3_prd_receipt_2023_0718_730017088912.pdf @cmdlineluser
Hm, ok well I don't see a number in this new example:
But the red outlined area seems to be present in .extract_text()
?
>>> print(page.extract_text())
防范打击通讯信息诈骗告知书
为进一步加大防范、治理通讯信息诈骗工作力度,加强开卡、用卡管控,特告知如下:
1.贩卖电话卡是违法行为,任何人不得将本人的电话卡转卖、转借、转租给他人,如将号码用于通信诈骗等违法活动,依照
《中华人民共和国刑事诉讼法》相关规定,公安机关将以帮助信息网络犯罪活动罪严厉处理,并纳入失信黑名单,同时需承担相
应法律责任。
2.用户存在以下通信异常疑似诈骗的,暂停通信服务:
(1)开卡后漫游至电信网络诈骗高危地且通信行为异常的;
(2)经他人投诉有诈骗、骚扰行为,一经核实的;
(3)频繁换机插卡或频繁补换卡;
(4)公安机关提供的涉案或高风险人员开办的号卡。
Hm, ok well I don't see a number in this new example:
But the red outlined area seems to be present in
.extract_text()
?>>> print(page.extract_text()) 防范打击通讯信息诈骗告知书 为进一步加大防范、治理通讯信息诈骗工作力度,加强开卡、用卡管控,特告知如下: 1.贩卖电话卡是违法行为,任何人不得将本人的电话卡转卖、转借、转租给他人,如将号码用于通信诈骗等违法活动,依照 《中华人民共和国刑事诉讼法》相关规定,公安机关将以帮助信息网络犯罪活动罪严厉处理,并纳入失信黑名单,同时需承担相 应法律责任。 2.用户存在以下通信异常疑似诈骗的,暂停通信服务: (1)开卡后漫游至电信网络诈骗高危地且通信行为异常的; (2)经他人投诉有诈骗、骚扰行为,一经核实的; (3)频繁换机插卡或频繁补换卡; (4)公安机关提供的涉案或高风险人员开办的号卡。
I can get this text too. But the page 2 of pdf contains phone number can't be find in the red outlined area .
Ah okay. So the problem is on page 2 of the updated PDF.
Yes, I get the same behaviour.
It appears none of the numbers inside the []
are detected.
>>> page2 = pdf.pages[1]
>>> print(page.extract_text())
中国电信号码优享业务协议
甲方(用户): (以下简称:甲方)
乙 方:中国电信股份有限公司 分公司(以下简称:乙方)
鉴于甲、乙双方已经签订《中国电信用户入网协议》,甲方基于对乙方移动通信服务的了解和需求,自愿选择使用乙方的号
码优享业务。为维护双方权益,根据相关法律、法规的规定,在平等、自愿、公平、诚实信用的基础上,甲乙双方就号码优享业
务及相关事宜达成以下协议,共同遵照执行。
一、甲方自愿 选择 套餐、 办理号码优享业务,并按照该套餐以及号码优享业务的规则享有相应的权利、
承担相应的义务。
二、就甲方办理的号码优享业务,甲方有权使用乙方提供的优享号码[ ], 并承诺接受以下业务规则:
1. 预存话费:甲方于本协议生效之日 预存[ ]元 至优享号码的通信费账户,该费用仅用于抵扣本协议约定的优享号
That helps with debugging things - thank you.
FWIW, it seems that the same text is not copy/paste-able from a standard PDF viewer into a plain text editor. I haven't examined the potential cause closely, but this suggests the issue might not be solvable via pdfplumber
.
It's possible (but just a guess) that this might be caused by glyph remappings, c.f., https://github.com/jsvine/pdfplumber/discussions/851#discussioncomment-5556104
Describe the bug
when I try to use extract_words() ,can't get some text ,for example then phonenumber
Have you tried repairing the PDF?
it not work.
PDF file
* https __order.crm.hcp.telecom.sd_rest_order_crm_BSS3_prd_receipt_2023_0718_730017088912.pdf
Environment
Additional context
Add any other context/notes about the problem here.