extract_text misses spaces between words

jtjohnston commented 2 years ago

Describe the bug

Extracting text frequently misses spaces between words resulting in many words being concatenated. This particular example is a pdf that was generated (likely) using a LaTeX compiler (e.g. pdflatex).

Code to reproduce the problem

import pdfplumber

with pdfplumber.open( "/path/to/file.pdf") as pdf:
    first_page = pdf.pages[0]
    print( first_page.extract_text() )

PDF file

An example pdf can be downloaded (as of 2/16/2022) with: wget https://proceedings.neurips.cc/paper/2021/file/000c076c390a4c357313fca29e390ece-Paper.pdf.

Expected behavior

The first line of the actual behavior (below) should be:

We provide improved gap-dependent regret bounds for reinforcement learning in

Actual behavior

(sub-sampled output):

Weprovideimprovedgap-dependentregretboundsforreinforcementlearningin
ﬁniteepisodicMarkovdecisionprocesses. Comparedtopriorwork,ourbounds
dependonalternativedeﬁnitionsofgaps. Thesedeﬁnitionsarebasedontheinsight
that,inordertoachieveafavorableregret,analgorithmdoesnotneedtolearnhow
tobehaveoptimallyinstatesthatarenotreachedbyanoptimalpolicy. Weprove
tighterupperregretboundsforoptimisticalgorithmsandaccompanythemwith
newinformation-theoreticlowerboundsforalargeclassofMDPs. Ourresults
showthatoptimisticalgorithmscannotachievetheinformation-theoreticlower
boundsevenindeterministicMDPsunlessthereisauniqueoptimalpolicy.

Environment

pdfplumber version: 0.6.0
Python version: 3.9.7
OS: Linux (Ubuntu 18, running in a WSL2 shell)

xelaos commented 2 years ago

Did you try to modify the x_tolerance parameter like this?

text = page.extract_text(x_tolerance=1)

jtjohnston commented 2 years ago

@xelaos Thanks, that did work for me (at least on this example). I guess the question now is: how do I know when/if I have to use that (e.g. if I'm extracting text automatically from lots of pdfs)? or how often is this needed? Why sometimes and not others? Etc.

jsvine commented 2 years ago

@jtjohnston Typically, you'll need to specify/adjust x_tolerance whenever you have typography that crams letters together very closely or spaces them apart very widely.

PDFs don't themselves have a concept of "words" and many PDFs don't include whitespace characters explicitly but rather depend on letter-spacing to visually represent that whitespace. So this library provides the x_tolerance parameter to let the user specify the minimum distance between letters that should be considered a word separator.

This library has generally shied away from "magic" — i.e., auto-tuning parameters. But there are likely some heuristics you could use to auto-guess the appropriate x_tolerance, especially if you have some general expectations about the types of PDFs you'll be processing (i.e., they'll all have a big chunk of text on the first page, et cetera).

Closing this issue for now, but feel free to continue the discussion.

Sarke commented 1 year ago

@jsvine I have the same problem. Let me know if I should start a new issue for this, but your above reply is very relevant.

My initial though is that ideally we would be able to set the tolerance as a fraction of the font-size, since both the words spacing and the line spacing usually change proportionally with the change in font-size.

jsvine commented 1 year ago

Thanks @Sarke, I think that's a nice idea and have opened a feature request issue here: https://github.com/jsvine/pdfplumber/issues/987

afriedman412 commented 11 months ago

@jsvine I have the same problem. Let me know if I should start a new issue for this, but your above reply is very relevant.

My initial though is that ideally we would be able to set the tolerance as a fraction of the font-size, since both the words spacing and the line spacing usually change proportionally with the change in font-size.

hey -- im working on this, do you have a pdf with crammed letters I can use for testing?

jsvine / pdfplumber