jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.1k stars 625 forks source link

Unnecessary spaces added at the middle of word #896

Open alzambranolu13 opened 1 year ago

alzambranolu13 commented 1 year ago

I'm trying to convert medical pdfs into text. The issue is, for example: In the PDF theres the text: "Common bile duct margin" Then extract_text() returns: " C ommon bile d uct m argin"

It seems that pdfplumber takes just C as a word by itself, same as d and m. Is it possible to avoid this?

Thanks, Alejandra

jsvine commented 1 year ago

Hi @alzambranolu13, try adjusting the x_tolerance parameter to a value larger than the default of 3. Does that work for you?

alzambranolu13 commented 1 year ago

Hi @jsvine I try changing to x_tolerance to many different values and nothing :(

jsvine commented 1 year ago

Thanks for checking. Can you share the PDF? (It'll be hard to diagnose your issue without it.)

o10baird commented 1 year ago

Hello, I am running into the same issue and can provide an example pdf and location where extra spaces are added.

I am using pdfplumber to extract text from U.S. Army Regulations. The regulations have text that is column justified. This means that there might be a variable amount of whitespace added between characters, although often it just looks like standard spacing too.

An example of this is the text: "Sends a fax copy of DD Form 553 to the Commander, Personnel Control Facility, Deserter Information Point, (ATZK–PMF–DIP), Fort Knox, Kentucky 40121. The fax machine number is DSN 536–3715, Commercial 502–626–3715."

Extracting the text results in the following: "Sends a fax copy of DD Form 553 to the Commander, Personnel Control Facility, Deserter Information Point, ( A T Z K – P M F – D I P ) , F o r t K n o x , K e n t u c k y 4 0 1 2 1 . T h e f a x m a c h i n e n u m b e r i s D S N 5 3 6 – 3 7 1 5 , C o m m e r c i a l 502–626–3715."

I have attached the .pdf this text is extracted from. You can find the specific text at the bottom of page 7, paragraph 3-3.f. AR_630-10.pdf

Any help is appreciated,

Ian

jsvine commented 1 year ago

Hi @o10baird, and thanks for providing the PDF, which is very helpful. What seems to be happening here is that the PDF contains a bunch of explicit whitespace characters (rather than implicit whitespace) to create that spacing. There are a few ways you can see/test this. The most straightforward is just to copy-paste that text from standard PDF viewing software into plain text editor. When I do that, I get this:

f. Sends a fax copy of DD Form 553 to the Commander, Personnel Control Facility, Deserter Information Point,
( A T Z K – P M F – D I P ) , F o r t K n o x , K e n t u c k y 4 0 1 2 1 . T h e f a x m a c h i n e n u m b e r i s D S N 502–626–3715.

You can also take a look at the page.chars objects, and see something similar. Likewise, pdfplumber's visual debugging tools can also help diagnose the situation:

page = pdf.pages[14]
im = page.to_image(resolution=150)
im.reset().draw_rects(page.chars)

Zooming in on the text in question:

The good news is that, at least for this particular PDF, it seems you can get the text you want just by filtering out all whitespace characters:

filtered = page.filter(lambda obj: obj.get("text") != " ")
print(filtered.extract_text())

... gets you:

f. Sends a fax copy of DD Form 553 to the Commander, Personnel Control Facility, Deserter Information Point,
(ATZK–PMF–DIP), Fort Knox, Kentucky 40121. The fax machine number is DSN 536–3715, Commercial
502–626–3715.
o10baird commented 1 year ago

@jsvine that worked wonders. Thank you for the assistance. I really appreciate it!

dhdaines commented 1 year ago

Hi @jsvine I try changing to x_tolerance to many different values and nothing :(

You can also try setting y_tolerance to something smaller, or use_text_flow=True. It may seem counterintuitive (it did to me) to change y_tolerance but sometimes there are explicit space characters nearby which end up getting merged into the same line of text by pdfplumber.

Alternately just removing all the explicit space characters and using the character spacing to break words, as mentioned above, is a good option.