rstudio / pointblank

Data quality assessment and metadata reporting for data frames and database tables
https://rstudio.github.io/pointblank/
Other
858 stars 53 forks source link

`yaml_write()` fails when `active` argument is a function that returns FALSE #355

Open emilyriederer opened 2 years ago

emilyriederer commented 2 years ago

Prework

Description

The yaml_write() function does not appear to work when active is provided an anonymous function the returns FALSE.

I suspect #345 could be related since it covers another edge case where write_yaml() fails a receiving a function versus a value as input. However, I thought this seemed like a distinct enough problem to merit its own issue.

The broader context for this problem is:

Reproducible example

library(pointblank)

# no `active` doesn't interrogate ----
agent1 <-
  create_agent(read_fn = ~data.frame(IND_A = 1, AMT_B = 2)) %>%
  col_vals_not_null(starts_with("ID"), step_id = 1) %>%
  col_vals_not_null(starts_with("IND"), step_id = 2)

res1 <- interrogate(agent1)
#> Error: Only strings can be converted to symbols
# `active` (FALSE function) does interrogate; doesn't write yaml ----
agent2 <-
  create_agent(read_fn = ~data.frame(IND_A = 1, AMT_B = 2)) %>%
  col_vals_not_null(
    starts_with("ID"), step_id = 1,
    active = ~. %>% {length(starts_with("ID", vars = colnames(.))) > 0} ) %>%
  col_vals_not_null(starts_with("IND"), step_id = 2)

res2 <- interrogate(agent2)
#> i Step 1 is not set as active. Skipping.
yaml_write(agent2, filename = "pb-test-2.yml", path = tempdir())
#> Error in if (step_list$column[[1]] == step_list$columns_expr) {: missing value where TRUE/FALSE needed
# `active` (TRUE function) does interrogate; writes yaml ----
agent3 <-
  create_agent(read_fn = ~data.frame(IND_A = 1, AMT_B = 2)) %>%
  col_vals_not_null(
    starts_with("AMT"), step_id = 1,
    active = ~. %>% {length(starts_with("AMT", vars = colnames(.))) > 0} ) %>%
  col_vals_not_null(starts_with("IND"), step_id = 2)

res3 <- interrogate(agent3)

yaml_write(agent3, filename = "pb-test-3.yml", path = tempdir())
#> v The agent YAML file has been written to `C:\Users\emily\AppData\Local\Temp\RtmpKIp4r6/pb-test-3.yml`
readLines(file.path(tempdir(), "pb-test-3.yml"))
#>  [1] "type: agent"                                                   
#>  [2] "read_fn: ~data.frame(IND_A = 1, AMT_B = 2)"                    
#>  [3] "tbl_name: ~"                                                   
#>  [4] "label: '[2021-09-17|08:13:32]'"                                
#>  [5] "lang: en"                                                      
#>  [6] "locale: en"                                                    
#>  [7] "steps:"                                                        
#>  [8] "- col_vals_not_null:"                                          
#>  [9] "    columns: starts_with(\"AMT\")"                             
#> [10] "    active: |-"                                                
#> [11] "      ~. %>% {"                                                
#> [12] "          length(starts_with(\"AMT\", vars = colnames(.))) > 0"
#> [13] "      }"                                                       
#> [14] "- col_vals_not_null:"                                          
#> [15] "    columns: starts_with(\"IND\")"
# `active` (value not function) with cols present does interrogate; writes yaml ----
agent4 <-
  create_agent(read_fn = ~data.frame(IND_A = 1, AMT_B = 2)) %>%
  col_vals_not_null(
    starts_with("AMT"), step_id = 1, active = FALSE) %>%
  col_vals_not_null(starts_with("IND"), step_id = 2)

res4 <- interrogate(agent4)
#> i Step 1 is not set as active. Skipping.
yaml_write(agent4, filename = "pb-test-4.yml", path = tempdir())
#> v The agent YAML file has been written to `C:\Users\emily\AppData\Local\Temp\RtmpKIp4r6/pb-test-4.yml`
readLines(file.path(tempdir(), "pb-test-4.yml"))
#>  [1] "type: agent"                               
#>  [2] "read_fn: ~data.frame(IND_A = 1, AMT_B = 2)"
#>  [3] "tbl_name: ~"                               
#>  [4] "label: '[2021-09-17|08:13:32]'"            
#>  [5] "lang: en"                                  
#>  [6] "locale: en"                                
#>  [7] "steps:"                                    
#>  [8] "- col_vals_not_null:"                      
#>  [9] "    columns: starts_with(\"AMT\")"         
#> [10] "    active: false"                         
#> [11] "- col_vals_not_null:"                      
#> [12] "    columns: starts_with(\"IND\")"

Created on 2021-09-17 by the reprex package (v0.3.0)

Session info ``` r devtools::session_info() ``` #> Error in get(genname, envir = envir) : object 'testthat_print' not found #> - Session info --------------------------------------------------------------- #> setting value #> version R version 4.0.2 (2020-06-22) #> os Windows 10 x64 #> system x86_64, mingw32 #> ui RTerm #> language (EN) #> collate English_United States.1252 #> ctype English_United States.1252 #> tz America/Chicago #> date 2021-09-17 #> #> - Packages ------------------------------------------------------------------- #> package * version date lib #> assertthat 0.2.1 2019-03-21 [1] #> backports 1.1.7 2020-05-13 [1] #> blastula 0.3.2 2020-05-19 [1] #> callr 3.4.3 2020-03-28 [1] #> cli 2.5.0 2021-04-26 [1] #> crayon 1.3.4 2017-09-16 [1] #> DBI 1.1.0 2019-12-15 [1] #> desc 1.2.0 2018-05-01 [1] #> devtools 2.3.1 2020-07-21 [1] #> digest 0.6.27 2020-10-24 [1] #> dplyr 1.0.7 2021-06-18 [1] #> ellipsis 0.3.2 2021-04-29 [1] #> evaluate 0.14 2019-05-28 [1] #> fansi 0.4.1 2020-01-08 [1] #> fs 1.5.0 2020-07-31 [1] #> generics 0.1.0 2020-10-31 [1] #> glue 1.4.2 2020-08-27 [1] #> highr 0.8 2019-03-20 [1] #> htmltools 0.5.1.1 2021-01-22 [1] #> knitr 1.33.8 2021-08-08 [1] #> lifecycle 1.0.0 2021-02-15 [1] #> magrittr 2.0.1 2020-11-17 [1] #> memoise 1.1.0 2017-04-21 [1] #> pillar 1.6.2 2021-07-29 [1] #> pkgbuild 1.1.0 2020-07-13 [1] #> pkgconfig 2.0.3 2019-09-22 [1] #> pkgload 1.1.0 2020-05-29 [1] #> pointblank * 0.8.0.9000 2021-09-17 [1] #> prettyunits 1.1.1 2020-01-24 [1] #> processx 3.4.3 2020-07-05 [1] #> ps 1.3.4 2020-08-11 [1] #> purrr 0.3.4 2020-04-17 [1] #> R6 2.5.0 2020-10-28 [1] #> remotes 2.2.0 2020-07-21 [1] #> rlang 0.4.11 2021-04-30 [1] #> rmarkdown 2.8 2021-05-07 [1] #> rprojroot 1.3-2 2018-01-03 [1] #> rstudioapi 0.13 2020-11-12 [1] #> sessioninfo 1.1.1 2018-11-05 [1] #> stringi 1.4.6 2020-02-17 [1] #> stringr 1.4.0 2019-02-10 [1] #> testthat 2.3.2 2020-03-02 [1] #> tibble 3.1.4 2021-08-25 [1] #> tidyselect 1.1.1 2021-04-30 [1] #> usethis 2.0.1 2021-02-10 [1] #> utf8 1.1.4 2018-05-24 [1] #> vctrs 0.3.8 2021-04-29 [1] #> withr 2.4.2 2021-04-18 [1] #> xfun 0.23 2021-05-15 [1] #> yaml 2.2.1 2020-02-01 [1] #> source #> CRAN (R 4.0.2) #> CRAN (R 4.0.0) #> CRAN (R 4.0.2) #> CRAN (R 4.0.2) #> CRAN (R 4.0.5) #> CRAN (R 4.0.2) #> CRAN (R 4.0.2) #> CRAN (R 4.0.3) #> CRAN (R 4.0.2) #> CRAN (R 4.0.3) #> CRAN (R 4.0.5) #> CRAN (R 4.0.5) #> CRAN (R 4.0.2) #> CRAN (R 4.0.2) #> CRAN (R 4.0.2) #> CRAN (R 4.0.3) #> CRAN (R 4.0.5) #> CRAN (R 4.0.2) #> CRAN (R 4.0.5) #> Github (yihui/knitr@55a2df9) #> CRAN (R 4.0.5) #> CRAN (R 4.0.3) #> CRAN (R 4.0.2) #> CRAN (R 4.0.5) #> CRAN (R 4.0.2) #> CRAN (R 4.0.2) #> CRAN (R 4.0.2) #> Github (rich-iannone/pointblank@40212d4) #> CRAN (R 4.0.2) #> CRAN (R 4.0.2) #> CRAN (R 4.0.2) #> CRAN (R 4.0.2) #> CRAN (R 4.0.5) #> CRAN (R 4.0.2) #> CRAN (R 4.0.5) #> CRAN (R 4.0.5) #> CRAN (R 4.0.2) #> CRAN (R 4.0.3) #> CRAN (R 4.0.2) #> CRAN (R 4.0.0) #> CRAN (R 4.0.2) #> CRAN (R 4.0.2) #> CRAN (R 4.0.5) #> CRAN (R 4.0.5) #> CRAN (R 4.0.5) #> CRAN (R 4.0.2) #> CRAN (R 4.0.5) #> CRAN (R 4.0.5) #> CRAN (R 4.0.5) #> CRAN (R 4.0.2) #> #> [1] C:/Users/emily/Documents/R/win-library/4.0 #> [2] C:/Program Files/R/R-4.0.2/library

title: reprex_reprex.R author: emily date: ‘2021-09-17’


Expected result

I would expect agent2 to behave the same was as agent3 when passed to write_yaml()

Session info

End the reproducible example with a call to sessionInfo() in the same session (e.g. reprex(si = TRUE)) and include the output.

rich-iannone commented 2 years ago

Thanks for discovering this bug! I will get to fixing it soon.

rich-iannone commented 2 years ago

Okay! Made a fix which I think is not too bad (skips a step when tidyselect expressions don't match any columns). Documentation currently doesn't state this is what happens (it definitely something to follow up on, if this is even the right approach). There may be unintended side effects but basic testing was added.

In fact, I'm going to re-open this so (1) you could weigh in on the changes and (2) there's more opportunities to find bugs, fix 'em, and add more tests.

emilyriederer commented 2 years ago

Wow - thanks so much, @rich-iannone ! That was so fast.

Overall, this definitely works great for my purposes. All four cases I list above work and produce reasonable interrogation output and YAML files. 🎉

The only thing I notice is cosmetic. For res1 and res2 where there is no match, the check shows up as follows in the interrogation output:

image

That's completely reasonable and not an issue, but I can imagine it might be mildly confusing to a non-developer to see a check with no columns in the output. I wonder if it might be helpful either the have the tidyselect conditions print where the column names should go and/or to suppress checks with no matching columns from the tabular results? 🤔 Again, that is a tiny aesthetic point and really not material either way.

Thanks again!

rich-iannone commented 2 years ago

Emily, this is great feedback (and of the sort of I was looking for)! I definitely want the reporting to be broadly understandable so I think that I can at least add some text stating that no columns were found (prior to other, better refinements).

You mentioned suppressing checks with no matching columns... does that mean excluding the step entirely from the resulting data and hence also the report (deleting instead of skipping)? There could be an option for this behavior available for that (probably in interrogate()?).

There are probably lots of little inconsistencies and possible design improvements for the reporting. Feel free to bring those to the fore by making one or several issues (I do want to improve this aspect of the package). Your feedback is super valuable and will be addressed!

emilyriederer commented 2 years ago

I'm honestly torn if I like the idea of checking it completely or not (arguing against my previous statement). On further reflection I can imagine two cases where I would definitely want the check to show even if not used:

So perhaps simple context as you said is better without suppression!

yjunechoe commented 5 months ago

Hi - just wanted to check in on this since I see it tagged for the next version milestone!

If I understand correctly, the initial issue about yaml writing seems to have been resolved. As for the remaining concern about the informativeness of the columns part of the report, the dev version's tidyselect overhaul now gives us the following reports (with some columns omitted for space) for the res1 and res2 examples. Note that the reprex has been slightly edited to keep the code up to date and avoid deprecation warnings.

agent1 <-
  create_agent(tbl = data.frame(IND_A = 1, AMT_B = 2)) %>%
  col_vals_not_null(starts_with("ID"), step_id = 1) %>%
  col_vals_not_null(starts_with("IND"), step_id = 2)
res1 <- interrogate(agent1)
res1

image

agent2 <-
  create_agent(tbl = data.frame(IND_A = 1, AMT_B = 2)) %>%
  col_vals_not_null(
    starts_with("ID"), step_id = 1,
    active = ~ . %>% has_columns(starts_with("ID"))
  ) %>%
  col_vals_not_null(starts_with("IND"), step_id = 2)
res2 <- interrogate(agent2)
res2

image

To summarize the relevant changes in the current dev version: 1) When tidyselect helpers are used to select columns but no columns are found, the agent report will show the original tidyselect expression (in this case, starts_with("ID"), though it's unfortunately cut off - I may revisit the font size). (#499) 2) The has_columns() helper now also supports tidyselect, which lets res2 express something like "validate that columns starting with ID do not have any missing values, if such columns exist". Even if this makes a step inactive, the attempted tidyselect expression is still shown in the report to provide some context for the skipping. (#500)

Please let me know how these changes feel for the examples! If there's any other concerns about the report, we can move it to a separate issue.

rich-iannone commented 5 months ago

@yjunechoe I don't find this all that bad! At least it doesn't fail in dev tidyselect and I suppose hovering over the starts_with(... text provides the rest of the string (though it's unformatted, it is at least informative). I think this is okay for the examples; the unfortunate aspect of that is that if the example output involves a screenshot of the table, those ones have to be regenerated.

BTW I've moved some of the items from the upcoming release milestone (mostly to do with overhaul work for scan_data() into the release after that). We could even punt on what's in the 2 items left in the upcoming release (I'm kinda thinking yeah on that one).

yjunechoe commented 5 months ago

Sounds good to me! I also don't have strong feelings on the remaining items (they seem isolated enough to get fixed in later patch versions). I only selfishly want to get a new release published but I don't mean to rush :)

Let me know if you need anything from my end!

rich-iannone commented 5 months ago

I also want to get the release out! I’ll push the remaining items to the next milestone and get this going.