jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.31k stars 647 forks source link

Consider punctuation when extracting words #682

Closed lolipopshock closed 2 years ago

lolipopshock commented 2 years ago

In this PR, we add an optional argument to the extract_words function that allows to enforce breaking tokens at punctuations.

Usage

# pdf_page contains: fine-tuned models. 
pdf_page.extract_words(split_at_punctuation=True) # It will separate at any punctuations in `string.punctuation`
# returns ["fine", "-", "tuned", "models", "."]
pdf_page.extract_words(split_at_punctuation='!\"&\'()*+,.:;<=>?@[\]^`{|}~') # It will separate at the specified punctuations 
# returns ["fine-tuned", "models", "."]

More visualization

The red boxes denote the detected tokens image

jsvine commented 2 years ago

Hi @lolipopshock, and thanks for submitting this PR! I really appreciate the clear explanation and examples. The CI pipeline is getting caught on the linting step. If you could reformat your code with psf/black — just run make format from the repository root — and commit/push, that'd be great. In the meantime, I'll take a closer look at the proposed changes.

lolipopshock commented 2 years ago

@jsvine thank you for your prompt response -- just updated the files!

jsvine commented 2 years ago

Thanks again, @lolipopshock. This seems like a worthwhile addition to the library, and I really appreciate you providing such a thorough PR. I'm going to merge into develop. I might also fiddle with the implementation slightly (merging this new logic with the keep_blank_chars logic), but nothing radical.