fix chars deduplication for words with intentionally duplicated chars

konradmalik commented 3 years ago

There is a problem with the current deduplication algorithm - it removes "intentionally" duplicated letters.

For example, if we run deduplication on chars ssttiillll, we will have stil, which is bad. We want still.

The solution is simple - instead of returning the first letter of each cluster of grouped characters, return every second character. This way, when the cluster has len=2 we will have 1 letter. When the cluster has len=4 we will have 2 letters. With clusters of len=1 we have 1 letter. Clusters of len=3 we have 2 letters which is neither good nor bad. In theory, this should not happen when the algorithm is run on a duplicated page.

This PR implements this change.

samkit-jain commented 3 years ago

Hi @konradmalik Appreciate you opening a pull request. Request you to update the PR as per the contribution guidelines mentioned at https://github.com/jsvine/pdfplumber/blob/stable/CONTRIBUTING.md#submitting-pull-requests

konradmalik commented 3 years ago

@samkit-jain thank you for the info. I've made 3 mistakes:

haven't read the contributing guidelines
wrongly assumed the error was in that place when in reality it was an actual problem with my pdf which proper tests showed me
created this PR too fast ;) I'm closing this one

jsvine / pdfplumber

fix chars deduplication for words with intentionally duplicated chars #504