jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

No spaces extracted found in first_page.chars #649

Closed sanchez5674 closed 2 years ago

sanchez5674 commented 2 years ago

Hi,

There was a related issue with spaces described here:

https://github.com/jsvine/pdfplumber/issues/334

The solution there works as explained and adding x_tolerance=1 did the trick if I only want to extract the text.

with pdfplumber.open(filename) as pdf:
first_page = pdf.pages[0]
text = first_page.extract_text(x_tolerance=1).split('\n')

My issue is I don't just want to extract the text. I would like to keep the format information (x, y positions, font, etc) so my code looks like this:

with pdfplumber.open(filename) as pdf:
  first_page = pdf.pages[0]
  df_char = df_char.append(pd.DataFrame.from_dict(first_page.chars))

Then, I recombine the characters to form the sentences if needed but, when doing this, there are no spaces in first_page.chars. How is this data extracted? What's the "x_tolerance used? Is there a way to modify it as well as in "extract_text()"

Environment

jsvine commented 2 years ago

HI @sanchez5674, and thanks for your interest in this library. Without seeing the specific PDF you're using, it's difficult to diagnose your situation accurately. But my initial hunch is that your PDF uses positioning rather than literal space characters to create the visual impression of spacing.

How is this data extracted?

page.chars contains the character data calculated by pdfminer.six (pdfplumber's core dependency). If there's a literal space character in the PDF, it should be there.

What's the "x_tolerance used?

x_tolerance is only used in .extract_text(...) and similar functions. Those functions look at both literal space characters and character positioning to determine where one word ends and the next begins.

If you want to take advantage of that part of the code without calling Page.extract_text(...) directly, you can use something like pdfplumber.utils.extract_text(my_chars, x_tolerance=...).

Closing this issue for now, but feel free to continue the discussion here.