This pdf has a complex 2-table on a single page in page 2. Right now the best result is setting the algorithm to header-position but it seems that one still needs to extend it to accommodate the odd table format.
import rows
tables = rows.import_from_pdf(
"Ibama.pdf",
page_numbers=[2],
algorithm="header-position", # `rects-boundaries` does not work and `y-groups` mixes header with entries
backend="pymupdf", # `pymupdf` yields the best results
)
row = tables[0]._asdict().keys()
dict_keys(
[
'praiadocarroquebrado',
'barradesantoantonio',
'field_2019_09_18',
'al',
'field_09203008s_35265532w_2019_10_21',
'oleada_manchas',
'barradoriocamaratuba',
'mataraca',
'field_2019_09_07',
'pb',
'field_06353346s_34575812w_2019_10_04',
'oleo_naoobservado',
'name',
'municipio',
'data_avist_estado_latitude',
'longitude',
'data_revis_status',
'praiadocabobranco',
'joaopessoa',
'field_2019_09_01',
'pb_2',
'field_07084334s_34483384w_2019_10_01',
'oleo_naoobservado_2'
]
)
I'm kind of jealous of R for the first time b/c this operation is a 1-liner with tabulizer ;-p
I'll look into extending header-position but if that is an exercise that should always be on the user side feel free to just close this issue.
This pdf has a complex 2-table on a single page in page 2. Right now the best result is setting the algorithm to
header-position
but it seems that one still needs to extend it to accommodate the odd table format.I'm kind of jealous of
R
for the first time b/c this operation is a 1-liner withtabulizer
;-pI'll look into extending
header-position
but if that is an exercise that should always be on the user side feel free to just close this issue.