Closed konradmalik closed 3 years ago
Hi @konradmalik Appreciate you opening a pull request. Request you to update the PR as per the contribution guidelines mentioned at https://github.com/jsvine/pdfplumber/blob/stable/CONTRIBUTING.md#submitting-pull-requests
@samkit-jain thank you for the info. I've made 3 mistakes:
There is a problem with the current deduplication algorithm - it removes "intentionally" duplicated letters.
For example, if we run deduplication on chars
ssttiillll
, we will havestil
, which is bad. We wantstill
.The solution is simple - instead of returning the first letter of each cluster of grouped characters, return every second character. This way, when the cluster has len=2 we will have 1 letter. When the cluster has len=4 we will have 2 letters. With clusters of len=1 we have 1 letter. Clusters of len=3 we have 2 letters which is neither good nor bad. In theory, this should not happen when the algorithm is run on a duplicated page.
This PR implements this change.