jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.02k stars 619 forks source link

help!解析文件会出现重复的文本 #992

Closed PeifengRen closed 6 months ago

PeifengRen commented 9 months ago
源文件

the output: 输出结果

PeifengRen commented 9 months ago

输出结果

PeifengRen commented 9 months ago

Parsing PDF files may result in duplicate content. For example, How are you? After parsing, How How are are you you?? Or How are you? How are you?. I don't know if you have ever encountered such a situation. Thank you very much indeed

wind-chh commented 9 months ago

the PDF file renders the same character twice instead of using bold font to achieve a bold effect, so you get duplicate texts.

You may need rewrite `

https://github.com/jsvine/pdfplumber/blob/94da66c1b32954d02ef03a5a9b30d0177d27af84/pdfplumber/utils/text.py#L562

` function to fit your situation.

jsvine commented 9 months ago

Thanks @wind-chh; to clarify: @PeifengRen, please try running page.dedupe_chars().extract_text().

PeifengRen commented 6 months ago

Thanks very much!!!