Closed PeifengRen closed 6 months ago
Parsing PDF files may result in duplicate content. For example, How are you? After parsing, How How are are you you?? Or How are you? How are you?. I don't know if you have ever encountered such a situation. Thank you very much indeed
the PDF file renders the same character twice instead of using bold font to achieve a bold effect, so you get duplicate texts.
You may need rewrite `
` function to fit your situation.
Thanks @wind-chh; to clarify: @PeifengRen, please try running page.dedupe_chars().extract_text()
.
Thanks very much!!!
the output: