jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

The colored word is read into two words #576

Closed Godlikemandyy closed 2 years ago

Godlikemandyy commented 2 years ago

After reading the PDF file, one blue word has changed into two identical words,eg: image "任何单位和个人" become "任任何何单单位为和和个个人人" What caused this and how to fix it!

Thank you

jsvine commented 2 years ago

Hi @Godlikemandyy, and thanks for your interest in this library. Without having access to the original PDF, or the code you used, it is difficult to answer your question. But I would suggest the following:

print(page.filter(lambda obj: not (
  obj["object_type"] == "char"
  and obj["non_stroking_color"] == "..."  # Replace "..." with the value determined in the previous step
).extract_text())

Because this is a specific-PDF troubleshooting question, rather than a bug or feature request, I'm closing this issue. Feel free to continue the discussion here, or through a new troubleshooting Discussion: https://github.com/jsvine/pdfplumber/discussions/categories/get-help-with-specific-pdfs