ropensci / tabulapdf

Bindings for Tabula PDF Table Extractor Library
https://docs.ropensci.org/tabulapdf/
Apache License 2.0
548 stars 71 forks source link

How can I get information from checkboxes in tables? #165

Open ddotta opened 4 months ago

ddotta commented 4 months ago

Prework

Question

I'm trying to extract data from a pdf document that contains tables with checkboxes (see my reproducible example below).

The extract_tables() function works well and manages to identify the tables in the pdf document, but I only get NA for all the checkboxes.
Is there any way of identifying which boxes are checked? Many thanks for your help ! 🙏

Reproducible example

Here's my pdf test.pdf

And my code :

library(tabulapdf)

fichier <- "test.pdf"
tableaux <- extract_tables(fichier, output = "tibble")

bases_de_conjoncture <- tableaux[[1]]
sources <- tableaux[[2]]

What I get :

# A tibble: 33 × 3
   `CERISE (Espace de Production des données)`                                 ...2       ...3          
   <chr>                                                                       <chr>      <chr>         
 1 Préciser ci-dessous la liste des sources statistiques (cf. liste sur GEDSI) NA         NA            
 2 Rubrique Source                                                             Producteur Chargé d'étude
 3 000_Referentiels                                                            NA         NA            
 4 0010_Balsa_IAA                                                              NA         NA            
 5 0020_Balsa_EA                                                               NA         NA            
 6 0030_Balsa_v2_EA                                                            NA         NA            
 7 0040_Geo                                                                    NA         NA            
 8 0050_BDNU                                                                   NA         NA            
 9 010_Territoires                                                             NA         NA            
10 1010_Enquete_TERUTI                                                         NA         NA            
11 020_Meteorologie                                                            NA         NA            
12 2010_Conj_meteo                                                             NA         NA            
13 030_Structures_exploitations                                                NA         NA            
14 3010_Enquetes_Structures                                                    NA         NA            
15 3020_Recensements                                                           NA         NA            
16 040_Pratiques_agricoles                                                     NA         NA            
17 4000_Pratiques_Culturales                                                   NA         NA            
18 4010_Pratiques_grandes_cultures                                             NA         NA            
19 4040_Pratiques_arboriculture                                                NA         NA            
20 4050_Pratiques_elevage                                                      NA         NA            
21 4060_Conso_energie_EA                                                       NA         NA            
22 4070_Conso_energie_EDT_CUMA                                                 NA         NA            
23 050_Productions_vegetales                                                   NA         NA            
24 5010_Terres_labourables                                                     NA         NA            
25 5030_Conj_Prairies                                                          NA         NA            
26 5040_Conj_viticole                                                          NA         NA            
27 5050_Conj_fruits                                                            NA         NA            
28 5060_Conj_legumes                                                           NA         NA            
29 060_Productions_viandes_oeufs                                               NA         NA            
30 6010_Enquetes_cheptels                                                      NA         NA            
31 6020_Abattage_gros_animaux                                                  NA         NA            
32 6030_Abattage_volailles_lapins                                              NA         NA            
33 6035_Abattages                                                              NA         NA     
ddotta commented 4 months ago

I managed to do what I wanted with pdftools::pdf_text() and some complications.

It would be very useful if this could be implemented directly in extract_tables()

pachadotdev commented 4 months ago

hi @ddotta thanks for reporting this how did you manage to do this?

ddotta commented 4 months ago

@pachadotdev Here's a solution - not very optimized but does what I want https://gist.github.com/ddotta/8e828145355bb87e78d83191b747b2e0