Closed sanchez5674 closed 2 years ago
HI @sanchez5674, and thanks for your interest in this library. Without seeing the specific PDF you're using, it's difficult to diagnose your situation accurately. But my initial hunch is that your PDF uses positioning rather than literal space characters to create the visual impression of spacing.
How is this data extracted?
page.chars
contains the character data calculated by pdfminer.six
(pdfplumber
's core dependency). If there's a literal space character in the PDF, it should be there.
What's the "x_tolerance used?
x_tolerance
is only used in .extract_text(...)
and similar functions. Those functions look at both literal space characters and character positioning to determine where one word ends and the next begins.
If you want to take advantage of that part of the code without calling Page.extract_text(...)
directly, you can use something like pdfplumber.utils.extract_text(my_chars, x_tolerance=...)
.
Closing this issue for now, but feel free to continue the discussion here.
Hi,
There was a related issue with spaces described here:
https://github.com/jsvine/pdfplumber/issues/334
The solution there works as explained and adding x_tolerance=1 did the trick if I only want to extract the text.
My issue is I don't just want to extract the text. I would like to keep the format information (x, y positions, font, etc) so my code looks like this:
Then, I recombine the characters to form the sentences if needed but, when doing this, there are no spaces in first_page.chars. How is this data extracted? What's the "x_tolerance used? Is there a way to modify it as well as in "extract_text()"
Environment