turicas / rows

A common, beautiful interface to tabular data, no matter the format
GNU Lesser General Public License v3.0
865 stars 136 forks source link

PDF Plugin #50

Closed turicas closed 5 years ago

turicas commented 9 years ago

Create an algorithm to automatically extract tables from PDFs (available in text format). Could use pdftables, but the code is not up-to-date, does not work with Python3 etc.

arnaldorusso commented 7 years ago

Starting to read information and pdftables lib #pybr12

turicas commented 6 years ago

I'm working on it (without pdftables, since it's not updated, have problems to install and requires numpy) and the results are promising! :)

claytonaalves commented 6 years ago

Waiting for it...

jeanprado commented 6 years ago

My university's restaurant has a pretty defined table, but it doesn't collect the data properly. Attached 2 examples.

The code I used: https://gist.github.com/jeanprado/ca735caa505aa1b91e2dfe500b8c2da0

pdf-examples-rows.zip

turicas commented 6 years ago

@jeanprado, thanks for the report! I'm not using rectangles on the page to determine in which cell the text objects go, which would be good for these cases, but I'm not sure it'll be the general case (the idea is to identify the table even if no rectangle is inside the PDF). I may add a setting to trigger this behavior, good catch! However, I've identified a problem in my code related to a specific case: when (for some reason) the code identifies only one column instead of two (it's the case for "QUINTA-FEIRA" and "SEXTA-FEIRA" on arquivo.pdf) it only adds one object per cell, not two - so I need to fix it. It happened becase the phrases "Pepino com tomate" and "Repolho com tomate" are in the same PDF object, so the algorithm thinks it's related to only one column, not two. You can see this in the following image:

03

As there are two cells merged into only one object in your case, even if I use the rectangles to identify cells, I'm not sure what to do, since there's no simple way to "cut" the cell contents.

Note: the file you've created in the gist is a python file so use .py, not .ipynb.

turicas commented 5 years ago

Almost complete to merge, but the implementation is working already (I'm using in several projects). Added 2 different PDF backends (pdfminer.six and pymupdf) and 3 different extraction algorithms. Need to tweak the imports, check the tests and merge. Note: after I've implemented this a library called Camelot was created and does exactly this, need to check how they're doing so I can improve the current algorithms.

turicas commented 5 years ago

Done on ac156572d9009061b5c1baf1dabc9be0dfee78b6.