jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

MIT License

6.57k stars 659 forks source link

Sort tables extracted on a page by their `top` position #338

Closed samkit-jain closed 3 years ago

samkit-jain commented 3 years ago

Fixes #336 h/t @gqh1995 for reporting

When multiple tables are found on a page, sorts them top to bottom. If there is a collision, sorts left to right. Not sure, but I think instead of sorting as I have done here, but perhaps, can update the sorting at https://github.com/jsvine/pdfplumber/blob/de767c3a52bbcacd237a17a051a3975d592023d9/pdfplumber/table.py#L315 instead.

PS: The changelog would need to be updated with the commit hash after this PR gets merged in.

codecov[bot] commented 3 years ago

Codecov Report

Merging #338 (70829c0) into develop (de767c3) will not change coverage. The diff coverage is 100.00%.

@@           Coverage Diff            @@
##           develop     #338   +/-   ##
========================================
  Coverage    98.17%   98.17%           
========================================
  Files           10       10           
  Lines         1205     1205           
========================================
  Hits          1183     1183           
  Misses          22       22

Impacted Files	Coverage Δ
pdfplumber/table.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update de767c3...a17c5a7. Read the comment docs.

samkit-jain commented 3 years ago

Thanks for the review @jsvine How about we add a new argument table_direction that controls whether the tables are read leftmost (default) or topmost? This would ensure that the change does not break any integrations.

jsvine commented 3 years ago

My initial instinct is not to add this argument. My hunch is that this change will affect very few integrations, and that there is (and will remain) a fairly easy way to customize the table order, by using page.find_tables(...) instead of page.extract_tables(...). E.g.,:

tables = sorted(my_page.find_tables(), key=lambda x: x.bbox)
extracted = [ t.extract() for t in tables ]

That said, I'm open to being persuaded otherwise. Thoughts?

samkit-jain commented 3 years ago

@jsvine I agree. The user has control over the sorting using the .find_tables(...) method which makes adding a new argument redundant.