jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

Fix bug in `dedup_chars()` in which `._objects` was accessed before assignment #294

Closed samkit-jain closed 3 years ago

samkit-jain commented 4 years ago

This PR fixes #293. Was uncaught in https://github.com/jsvine/pdfplumber/commit/04fd56ac405fd753e7f9c826ce103459013f2e71 probably because other methods may have called .objects before the call to .dedup_chars().

To resolve the issue, I replaced the usage of ._objects with .objects. If only ._objects is to be used, can add a simple self.objects statement like

    def dedupe_chars(self, **kwargs):
        """
        Removes duplicate chars — those sharing the same text, fontname, size,
        and positioning (within `tolerance`) as other characters on the page.
        """
        self.objects  # Statement has no effect but is useful as it would instantiate `._objects` for further use below
        p = FilteredPage(self, True)
        p._objects = dict((kind, objs) for kind, objs in self._objects.items())
        p._objects["char"] = utils.dedupe_chars(self.chars, **kwargs)
        return p

Can also do the same in the __init__() method instead as well.

NOTE: I added a new PDF issue-71-duplicate-chars-2.pdf (thanks to @Fleur09 for sharing in #292) as it has repeating characters.

codecov[bot] commented 4 years ago

Codecov Report

Merging #294 into develop will not change coverage. The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff            @@
##           develop     #294   +/-   ##
========================================
  Coverage    97.47%   97.47%           
========================================
  Files           10       10           
  Lines         1190     1190           
========================================
  Hits          1160     1160           
  Misses          30       30           
Impacted Files Coverage Δ
pdfplumber/page.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 04fd56a...b132d45. Read the comment docs.

jsvine commented 3 years ago

Great catch, thanks! Merging.