jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.02k stars 619 forks source link

fix issue 964 #965

Open jnhyperion opened 11 months ago

jnhyperion commented 11 months ago

I found that this issue is caused by some blank chars is overlapped with the following non blank chars. The simple solution is to remove these overlapped blank chars.

fix: https://github.com/jsvine/pdfplumber/issues/964

jsvine commented 10 months ago

Thanks for this proposal, @jnhyperion. I think this particular change isn't quite right for the library, as it's quite specific to a particular (and relatively uncommon) edge case. I find that changes like those might fix the handling of some PDFs, but risk causing problems for others, as there's such a wide variety of PDFs. But perhaps we can think of a more general feature that would still help for your use case, such as a simple .extract_text(ignore_whitespace=True) parameter or Page.remove_whitespace(..., only_overlapping=True) method (in a similar spirit to Page.dedupe_chars(...)).

jnhyperion commented 10 months ago

you're right, I added a new method Page.remove_whitespace.