jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

Discussion - Better Table Extraction on "text" Strategy #238

Closed BenJacobs1 closed 4 years ago

BenJacobs1 commented 4 years ago

Hi pals

I was going through the table finding procedure (while debugging the extract_table function that didn't provide the correct table) with "vertical_strategy": "text","horizontal_strategy": "text" and wondered if it would better to do it by sentences and not by words

The core concept is that in tables - the unit of text in which we are looking for patterns is not a word but rather a sentence, and if we would run the x & y clustering (when looking for edges) we would get much cleaner results and the table would make more sense.

Considering that these are table we are talking about - I believe a simple definition of "sentence" would be spatial lookup, meaning - if we have another word to the left or right (or top & bottom depending the text direction) in a reasonable distance we would concatenate it to the the sentence

There are 2 implementation issues I'm still not sure about: 1. How to implement the the spatial lookup without degrading the runtime performance 2. How to tell what's the text direction - because I believe trying to find a left-to-right sentence on a vertical text would be wrong