turicas / rows

A common, beautiful interface to tabular data, no matter the format
GNU Lesser General Public License v3.0
869 stars 134 forks source link

Parse complicated two tables per page PDF #340

Open ocefpaf opened 5 years ago

ocefpaf commented 5 years ago

This pdf has a complex 2-table on a single page in page 2. Right now the best result is setting the algorithm to header-position but it seems that one still needs to extend it to accommodate the odd table format.

import rows

tables = rows.import_from_pdf(
    "Ibama.pdf",
    page_numbers=[2],
    algorithm="header-position", # `rects-boundaries` does not work and `y-groups` mixes header with entries
    backend="pymupdf",  # `pymupdf` yields the best results
)

row = tables[0]._asdict().keys()

dict_keys(
    [
        'praiadocarroquebrado',
        'barradesantoantonio',
        'field_2019_09_18',
        'al',
        'field_09203008s_35265532w_2019_10_21',
        'oleada_manchas',
        'barradoriocamaratuba',
        'mataraca',
        'field_2019_09_07',
        'pb',
        'field_06353346s_34575812w_2019_10_04',
        'oleo_naoobservado',
        'name',
        'municipio',
        'data_avist_estado_latitude',
        'longitude',
        'data_revis_status',
        'praiadocabobranco',
        'joaopessoa',
        'field_2019_09_01',
        'pb_2',
        'field_07084334s_34483384w_2019_10_01',
        'oleo_naoobservado_2'
    ]
)

I'm kind of jealous of R for the first time b/c this operation is a 1-liner with tabulizer ;-p

I'll look into extending header-position but if that is an exercise that should always be on the user side feel free to just close this issue.

turicas commented 5 years ago

Note: try with tabula-py: https://github.com/ocefpaf/oilmap