jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

fix chars deduplication for words with intentionally duplicated chars #504

Closed konradmalik closed 3 years ago

konradmalik commented 3 years ago

There is a problem with the current deduplication algorithm - it removes "intentionally" duplicated letters.

For example, if we run deduplication on chars ssttiillll, we will have stil, which is bad. We want still.

The solution is simple - instead of returning the first letter of each cluster of grouped characters, return every second character. This way, when the cluster has len=2 we will have 1 letter. When the cluster has len=4 we will have 2 letters. With clusters of len=1 we have 1 letter. Clusters of len=3 we have 2 letters which is neither good nor bad. In theory, this should not happen when the algorithm is run on a duplicated page.

This PR implements this change.

samkit-jain commented 3 years ago

Hi @konradmalik Appreciate you opening a pull request. Request you to update the PR as per the contribution guidelines mentioned at https://github.com/jsvine/pdfplumber/blob/stable/CONTRIBUTING.md#submitting-pull-requests

konradmalik commented 3 years ago

@samkit-jain thank you for the info. I've made 3 mistakes: