jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.31k stars 647 forks source link

It might be character set problem #766

Closed Gadil-1987 closed 1 year ago

Gadil-1987 commented 1 year ago

Describe the bug

I used pdfplumber to read pdf file, the result was not the same thing as my expected. I also tried to copy the chars from the pdf, and i pasted the chars to anywhere,the result were wrong. But, I used the office-word to open the pdf,the result was ok

Code to reproduce the problem

image

PDF file

* 600278 东方创业2012年度股东大会的法律意见书_600278_20130423_2.pdf

*

Expected behavior

key part in PAGE-4(本页无正文,为《金茂凯德律师事务所关于东方国际创业股份有限公司2012 年度股东大会的法律意见书》之签署页)

Environment

jsvine commented 1 year ago

Hi @Gadil-1987, and thank you for filing this issue. Given that the result is the same as copy-pasting directly from the document, this seems unlikely to be something pdfpumber can fix. It's possible that office-word is doing something special that this library could learn from, but I'm not sure what that would be. Closing for now, but feel free to continue the discussion here.