ropensci / pdftools

Text Extraction, Rendering and Converting of PDF Documents
513 stars 69 forks source link

Can pdftools distinguish between radio and checkbox entries on a fillable form? #129

Open ibecav opened 5 months ago

ibecav commented 5 months ago

The package has worked extremely well on processing "traditional" non fillable forms -- thank you.

In my first attempts at using it with "fillable forms" I can't seem to find a way to distinguish between radio buttons or checkboxes that are selected and those that are not. I'm not sure if I'm missing some nuance, making a complete mistake, or whether the functions don't support it?

An example "blank" original form is at An example "blank" original form is at. For the reprex below I am focusing on a small segment of the form on page 1 that I have included as screenshots in the original state add after filling out and saving a few entries.

example_filled_form_segment original_blank_segment

I would like to know if there is a way to distinguish the fact that "Long Term Care" is selected in the filled out form versus not selected in the original?

Thank you in advance. Below is what I hope is a reprex that will help, since I could not find an easy safe place to "post" the example filled out form I used dput to put the resulting data in the reprex obviously users can grab the original and dave changes to their local filesystem if desired.

## Not sure if poppler version matters?
#> Using poppler version 23.04.0
## Download and save the original form as original.pdf
## Let's use just the first page for the reprex
## Using pdf_data() for the convenience of having a tibble
## Same problem if I use pdf_text
original_pageone <- pdf_data("original.pdf")[[1]]

original_pageone_segment <-
  original_pageone %>% 
  filter(y >= 229, y <= 290)

# no obvious errors but difficult to see the the radio button
# "text" in RStudio console

# original_pageone_segment %>% print(n = Inf)

# Fill in the form with some data.  It works and I can see
# traditional text such as "1234" and "5678" I entered on the form
# filled_pageone <- pdf_data("example_filled_form.pdf")[[1]]

# use dput to capture the resulting tibble for the reprex 
# filled_pageone %>% 
#   filter(y >= 229, y <= 290) %>% dput()

filled_pageone_segment <-
  structure(list(width = c(28L, 18L, 41L, 13L, 53L, 19L, 16L, 49L, 
                           8L, 13L, 18L, 8L, 32L, 7L, 22L, 17L, 31L, 3L, 26L, 25L, 31L, 
                           7L, 40L, 17L, 7L, 90L, 17L, 7L, 22L, 32L, 8L, 48L, 17L, 17L, 
                           28L, 8L, 8L, 48L, 17L), 
                 height = c(11L, 11L, 11L, 11L, 11L, 11L, 
                            11L, 11L, 11L, 11L, 11L, 11L, 11L, 9L, 11L, 11L, 11L, 11L, 11L, 
                            11L, 11L, 9L, 11L, 11L, 9L, 11L, 11L, 9L, 11L, 11L, 11L, 11L, 
                            7L, 11L, 11L, 11L, 11L, 11L, 7L), 
                 x = c(31L, 61L, 81L, 125L, 
                       140L, 195L, 217L, 31L, 82L, 92L, 108L, 128L, 138L, 37L, 49L, 
                       73L, 92L, 126L, 131L, 159L, 186L, 37L, 49L, 91L, 37L, 49L, 142L, 
                       37L, 49L, 73L, 395L, 406L, 459L, 275L, 294L, 325L, 335L, 346L, 
                 y = c(229L, 229L, 229L, 229L, 229L, 229L, 229L, 240L, 
                       240L, 240L, 240L, 240L, 240L, 255L, 254L, 254L, 254L, 254L, 254L, 
                       254L, 254L, 267L, 266L, 266L, 278L, 278L, 278L, 290L, 290L, 290L, 
                       229L, 229L, 230L, 248L, 248L, 248L, 249L, 249L, 250L), 
                 space = c(TRUE, 
                           TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, 
                           TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, 
                           TRUE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, 
                           TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE), 
                 text = c("Facility", 
                          "type", "(Complete", "the", "demographic", "form", "that", "corresponds", 
                          "to", "the", "type", "of", "facility):", "●", "Acute", "Care", 
                          "Hospital", "/", "Critical", "Access", "Hospital", "●", "Long-term", 
                          "Care", "●", "Outpatient/Ambulatory", "Care", "●", "Other", 
                          "(specify):", "(if", "applicable):", "1234", "CMS", "Facility", 
                          "ID", "(if", "applicable):", "5678")), 
            class = c("tbl_df", "tbl", 
            row.names = c(NA, -39L))

## Use arsensal to compare tibbles in detail
summary(comparedf(original_pageone_segment, filled_pageone_segment, by = c("x", "y")))
#> Table: Summary of data.frames
#> version   arg                         ncol   nrow
#> --------  -------------------------  -----  -----
#> x         original_pageone_segment       6     37
#> y         filled_pageone_segment         6     39
#> Table: Summary of overall comparison
#> statistic                                                      value
#> ------------------------------------------------------------  ------
#> Number of by-variables                                             2
#> Number of non-by variables in common                               4
#> Number of variables compared                                       4
#> Number of variables in x but not y                                 0
#> Number of variables in y but not x                                 0
#> Number of variables compared with some values unequal              1
#> Number of variables compared with all values equal                 3
#> Number of observations in common                                  37
#> Number of observations in x but not y                              0
#> Number of observations in y but not x                              2
#> Number of observations with some compared variables unequal        2
#> Number of observations with all compared variables equal          35
#> Number of values unequal                                           2
#> Table: Variables not shared
#>  ------------------------
#>  No variables not shared 
#>  ------------------------
#> Table: Other variables not compared
#>  --------------------------------
#>  No other variables not compared 
#>  --------------------------------
#> Table: Observations not shared
#> version      x     y   observation
#> --------  ----  ----  ------------
#> y          399   250            39
#> y          459   230            33
#> Table: Differences detected by variable
#> var.x    var.y      n   NAs
#> -------  -------  ---  ----
#> width    width      0     0
#> height   height     0     0
#> space    space      2     0
#> text     text       0     0
#> Table: Differences detected
#> var.x   var.y      x     y  values.x   values.y    row.x   row.y
#> ------  ------  ----  ----  ---------  ---------  ------  ------
#> space   space    346   249  FALSE      TRUE           37      38
#> space   space    406   229  FALSE      TRUE           32      32
#> Table: Non-identical attributes
#>  ----------------------------
#>  No non-identical attributes 
#>  ----------------------------

#> R version 4.3.2 (2023-10-31)
#> Platform: aarch64-apple-darwin20 (64-bit)
#> Running under: macOS Sonoma 14.2.1
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> time zone: America/New_York
#> tzcode source: internal
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> other attached packages:
#> [1] pdftools_3.4.0 arsenal_3.6.3  dplyr_1.1.4   
#> loaded via a namespace (and not attached):
#>  [1] vctrs_0.6.5       cli_3.6.2         knitr_1.45        rlang_1.1.3      
#>  [5] xfun_0.42         purrr_1.0.2       styler_1.10.2     generics_0.1.3   
#>  [9] glue_1.7.0        askpass_1.2.0     qpdf_1.3.2        htmltools_0.5.7  
#> [13] fansi_1.0.6       rmarkdown_2.25    R.cache_0.16.0    tibble_3.2.1     
#> [17] evaluate_0.23     fastmap_1.1.1     yaml_2.3.8        lifecycle_1.0.4  
#> [21] compiler_4.3.2    fs_1.6.3          Rcpp_1.0.12       pkgconfig_2.0.3  
#> [25] rstudioapi_0.15.0 R.oo_1.26.0       R.utils_2.12.3    digest_0.6.34    
#> [29] R6_2.5.1          tidyselect_1.2.0  utf8_1.2.4        reprex_2.1.0     
#> [33] pillar_1.9.0      magrittr_2.0.3    R.methodsS3_1.8.2 tools_4.3.2      
#> [37] withr_3.0.0

Created on 2024-03-29 with reprex v2.1.0