Closed trrk closed 3 years ago
Hi @trrk Appreciate your interest in the library. The extract_text()
method adds a newline when the difference between the doctop
of one character and the doctop
of the next character is greater than y_tolerance
(defaults to 3). In bbox1
's case, the doctop
of the top row is 109.440 and the bottom is 111.567. Since the difference is 2.127 which is less than the default y_tolerance
of 3, it does not add a new line. You can reduce the y_tolerance
to a lower value like 2 and then extract the text with the newline. Example extract_text(y_tolerance=2)
.
I see. I like the default y_tolerance
value, and I use the library from the interface that allows me to select an area on the screen. I would therefore like to keep y_tolerance
at 3. But then I read the documentation again and realized that I might be able to use the filter function to remove the small characters at the edges. Thank you for your answer.
What are you trying to do?
Cropped page text extraction
What code are you using to do it?
PDF file
abcde.pdf
Expected behavior
Actual behavior
Screenshots
None
Environment
Additional context
When the crop range contains part of another line, the results are merged into one line. This seems to happen only when the range contains only a small part of another line.
I think it is undesirable to be merged into one line. Is there a good way to avoid this?