jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.31k stars 647 forks source link

Distinguish between bold and non-bold Fonts #724

Open lycfight opened 2 years ago

lycfight commented 2 years ago

Although the .chars object can provide fonts, it can't distinguish between bold and non-bold text with the same font, which make s sense to PDF parsing

jsvine commented 2 years ago

Hi @lycfight, and thanks for your interest in this library. As I understand the PDF specification, there is not a "bold" attribute to text. Rather, the font itself either is or isn't bold, something typically (but not always) indicated in the fontname property. Unfortunately, it's difficult to troubleshoot your particular situation with knowing the PDF you're examining. Are you able to provide that?

lycfight commented 2 years ago

I try to extract important sequences, so I concate the chars with the same key, which joint by fontname, size, stroking_color and non_stroking_color. But the bold first sentence is mixed with the rest of the text in the non bold paragraph.

test file: H3_AP202209071578120954_1.pdf

my code:

    font_dict = dict()
    for char_item in page.chars:
        key =  char_item["fontname"] + "_" + str(int(10 * char_item["size"])) + "_" + str(char_item["stroking_color"]) + "_" + str(char_item["non_stroking_color"])
        if key not in font_dict:
            font_dict[key] = []
        font_dict[key].append(char_item["text"])
    for key in font_dict:
        font_dict[key] = "".join(font_dict[key])
    print(json.dumps(font_dict, ensure_ascii=False))
file = './H3_AP202209071578120954_1.pdf'
with pdfplumber.open(file) as pdf:
    parse_importance_sen(pdf.pages[0])

the result:

{
    "ArialMT_90_0_0":"      S0350521120005  yangy08@ghzq.com.cn   1M 3M 12M  -4.45% 2.21% -15.25% 300 -3.94% -1.46% -17.37% ",
    "BCDEEE+KaiTi_GB2312_105_0_0":"国海证券研究所请务必阅读正文后免责条款部分最近一年走势行业相对表现主要观点:)英伟达与对华出口高端芯片受限,国内云计算与人工智能发展或受影响。月日,芯片巨头英伟达发布公告,声称若对中国(含中国香港)和俄罗斯的客户出口两款高端芯片——和,需要新的出口许可。除此之外,另一家芯片巨头也被要求断供用于人工智能和数据中心的顶级计算芯片。(图形处理器)主要应用于图显和计算两大方面,更适用于密集型数据处理。在高端芯片国产替代能力不足的背景下,此类芯片的断供可能会直接影响国内云计算、人工智能产业的发展。另外,英伟达芯片目前已广泛应用于国内各车企自动驾驶域控制器中,若断供范围扩大或将引发市场担忧。)嬴彻科技发布《自动驾驶卡车量产白皮书》,披露从量产走向无人的三阶段技术路线。月日,嬴彻科技举办以“实践出真知”为主题的首届科技日,首次完整披露从量产走向无人的三阶段技术路线,并发布《自动驾驶卡车量产白皮书》。白皮书内容丰富,涵盖需求定义、系统开发、流程与工具、指标体系等多方面。据介绍,嬴彻科技嬴彻全栈自研技术已迈入阶段,并在核心技术上取得重要突破,未来有望在自动驾驶卡车领域占据领先地位。)五菱宏光敞篷版发布,采用抽签方式出售。为满足市场需求,五菱汽车在已大获成功的五菱宏光车型基础上推出了敞篷版车型,主打年轻人市场。新车突出亮点就是采用了无边框车门,以及电动软顶敞篷结构,创新的半自动开关篷结构。销售模式方面,五菱宏光敞篷版采用抽签形式,用户可通过小程序抽取购买资格。另外,敞篷版车型与普通版相比,在动力、续航、安全性与舒适性配置等方面均有所升级。)发布《年汽车与工业领域激光雷达应用报告》:汽车市场将成主要驱动力,禾赛引领中国企业突围。全球知名市场研究与战略咨询公司发布《年汽车与工业领域激光雷达应用报告》,预计未来五年,激光雷达整体市场仍将延续强劲的增长势头,或以",
    "ArialMT_105_0_0":"                  831GPUA100H100AMDGPUOrin 912.0 MINIEVMINIEV Yole2022 ",
    "Arial-BoldMT_140_1_1":" 20220907      ",
    "BCDEEE+KaiTi_GB2312_140_1_1":"年月日中小盘行业周报",
    "BCDEEE+KaiTi_GB2312_90_0_0":"研究所证券分析师:杨阳表现中小盘沪深",
    "Arial-BoldMT_105_0_0":"[Table_Title] 1AMDGPU23MINIEV4Yole2022",
    "Arial-BoldMT_159_0_0":"    ",
    "BCDEEE+KaiTi_GB2312_159_0_0":"英伟达与对华出口高端芯片受限,国内云计算与人工智能发展或受影响投资要点:",
    "BCDFEE+KaiTi_GB2312_159_0_0":"AMDGPU ",
    "Arial-BoldMT_150_0_0":"—— ",
    "BCDEEE+KaiTi_GB2312_150_0_0":"中小盘行业周报",
    "ArialMT_4_0_0":"  ",
    "ArialMT_105_1_1":" ",
    "BCDGEE+Wingdings-Regular_105_0_0":"◼"
}

the bold text in the first page: 截屏2022-09-07 21 44 37

we can find that follow bold first sentences underlined in red mixed with the remaining non bold text in the paragraph: 1)英伟达与 AMD 对华出口高端 GPU 芯片受限,国内云计算与人工智能发展或受影响。 2)嬴彻科技发布《自动驾驶卡车量产白皮书》,披露从量产走向无人的三阶段技术路线。 3)五菱宏光 MINIEV 敞篷版发布,采用抽签方式出售。 4)Yole 发布《2022 年汽车与工业领域激光雷达应用报告》:汽车市场将成主要驱动力,禾赛引领中国企业突围。

jsvine commented 1 year ago

Thank you for providing the PDF and the context, @lycfight. It is very helpful. I've spent some time investigating the PDF but unfortunately have not been able to find an answer yet. At first, I thought it might be an issue with the font definitions in the PDF — perhaps two different fonts (bold and not bold) using the same font name. But that doesn't seem to be the case. It also doesn't seem that the PDF is using double-characters or other common tricks PDFs use to imitate bold fonts. So I'll have to do a bit more research until I have a more definitive answer.

lycfight commented 1 year ago

Thank you, I'll try some other methods.

qiancheng99 commented 1 year ago

Do you find any methods to extract these bold texts such as subtitles? Thank you!

zhongshanguo commented 4 months ago

In pdfbox, font description infomation can be extracted, but distinguishing like PDType0Font, PDType1Font, etc. Some can obtain the weight attribute "Bold" of the font, while others cannot But some can be judged by fontname Overall, it is possible to obtain the result of whether the imperfect bold font is present I hope plumber can also implement related functions, thanks. 在pdfbox中, 能提取到字体的描述, 但是区分PDType0Font, PDType1Font等, 且有的可以得到字体的weight属性"Bold", 有的不能. 有些可以通过fontname来判断. 综合起来,能获取到不完美的是否粗体这个结果. 希望plumber也能实现相关的功能. 目前我采取的是双管齐下, 综合两个结果后来处理. pdfbox的结果取得xy坐标, bold, italic属性, 但不能方便的取到有关color的属性. 最后根据xy坐标和内容对应起来, 很麻烦.