jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.48k stars 658 forks source link

alway convert patterns 'ti' to number 5 when extecting text, like question to ques5on, solution to solu5on. #1043

Closed rucwangw closed 11 months ago

rucwangw commented 11 months ago

Source PDF:

图片

The extracted text as following:

图片
jsvine commented 11 months ago

This seems likely to be an issue with the PDF itself, and how it encodes text. One way to test this yourself is to open the PDF in a standard PDF reader, select some of the text, and paste it into a text editor. Do you also see the 5s either? If not, could you provide the PDF?

rucwangw commented 11 months ago

thanks for your response. It's the PDF issue.