Open jnhyperion opened 11 months ago
Thanks for this proposal, @jnhyperion. I think this particular change isn't quite right for the library, as it's quite specific to a particular (and relatively uncommon) edge case. I find that changes like those might fix the handling of some PDFs, but risk causing problems for others, as there's such a wide variety of PDFs. But perhaps we can think of a more general feature that would still help for your use case, such as a simple .extract_text(ignore_whitespace=True)
parameter or Page.remove_whitespace(..., only_overlapping=True)
method (in a similar spirit to Page.dedupe_chars(...)
).
you're right, I added a new method Page.remove_whitespace
.
I found that this issue is caused by some blank chars is overlapped with the following non blank chars. The simple solution is to remove these overlapped blank chars.
fix: https://github.com/jsvine/pdfplumber/issues/964