jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.48k stars 658 forks source link

when I use the extract_text funtion, the x_tolerance argument doesn't work for me. #1004

Closed papandadj closed 11 months ago

papandadj commented 12 months ago

Thank you for providing such a great open-source project. However, there are some things I don't quite understand when using it.

Describe the bug

image

in this pdf, two characters are enclosed by red rectangles.

Below is the information of my debugging code.

image

I have a question. The difference between the x1 of 'e' and the x0 of '的' is less than x_tolerance (364-361<10000). Why is there a space printed between the two characters?

Did I misunderstand something somewhere?