jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.1k stars 625 forks source link

dedupe_chars() method get error #842

Closed 154192 closed 1 year ago

154192 commented 1 year ago

image 300218_2011.pdf

jsvine commented 1 year ago

Thanks for raising this issue @154192. The intermediate issue appears to be that the fontnames for some some of the PDF's characters are being read as bytes — e.g., b'RGJSAP+\xcb\xce\xcc\xe5' — instead of strings. I'm not yet sure whether this is an issue with the PDF itself or with pdfminer.six, the library pdfplumber uses as its PDF parser. I hope to take a closer look soon.

jsvine commented 1 year ago

This should now be fixed in v0.9.0, but let me know if it's still not working for you.