insightsengineering / teal

Exploratory Web Apps for Analyzing Clinical Trial Data
https://insightsengineering.github.io/teal/
Other
166 stars 32 forks source link

Remove `library` calls from `get_rcode_header` in favour of manual specification #950

Open gogonzo opened 8 months ago

gogonzo commented 8 months ago

In the preprocessing app developer can add library(pkg) calls in the eval_code/within call. Thanks to this code in qenv/teal_data will contain library calls, so will be more reproducible.

data <- teal_data() |> within({
  library(ggplot2)
  ggplot(iris, aes(x = Sepal.Length, y = Petal.Length)) + geom_point()
})

However, when sending data to the module another list of libraries is appended. We need to decide about who should be responsible for listing libraries used in a code (app developer) or (teal).

We need to find robust way to do manage this.

chlebowa commented 6 months ago

Questions:

  1. Do we want to add library calls for all packages that are attached at the end of the module or all packages that are loaded? I would go with the former.
  2. Perhaps we should still see what namespaces are loaded and raise a warning/error if they are not installed in the environment where the analysis is being reproduced?
  3. teal is most likely to be called before teal_data or teal_data_module is created and therefore before there is anything to do things within it. I take it library(teal) should be added on top of the @code slot?
m7pr commented 6 months ago

Do we want to add library calls for all packages that are attached at the end of the module or all packages that are loaded? I would go with the former.

I think for all attached. Even though there is :: in the code of the module, this does not guarantee packages are installed on the end user who asked for the reproducible code. We can also extend the list of returned libraires by calls that install packages for you, but I think we are considering having renv and .lock files?

Perhaps we should still see what namespaces are loaded and raise a warning/error if they are not installed in the environment where the analysis is being reproduced?

Isn't such warning gonna be presented on an app start? If someone created a code on one environment, but then releases it somewhere else, (s)he will see that he is not able to run the app, or one of the modules will fail?

teal is most likely to be called before teal_data or teal_data_module is created and therefore before there is anything to do things within it. I take it library(teal) should be added on top of the @code slot?

Is teal needed to reproduce the analysis? If we have a regular analysis, I think dplyr would be suffiecient. If we have any module that is attached, then it will be returned (like in i. - first point) and module will be dependend on teal, hence teal also will be loaded in the moment you run the code returned for the reproducibility?

chlebowa commented 6 months ago

Do we want to add library calls for all packages that are attached at the end of the module or all packages that are loaded? I would go with the former.

I think for all attached. Even though there is :: in the code of the module, this does not guarantee packages are installed on the end user who asked for the reproducible code. We can also extend the list of returned libraires by calls that install packages for you, but I think we are considering having renv and .lock files?

I don't think auto-installing is a good idea.


Perhaps we should still see what namespaces are loaded and raise a warning/error if they are not installed in the environment where the analysis is being reproduced?

Isn't such warning gonna be presented on an app start? If someone created a code on one environment, but then releases it somewhere else, (s)he will see that he is not able to run the app, or one of the modules will fail?

The reproducible code will not be run in an app, it will be taken from the report and run wherever.


teal is most likely to be called before teal_data or teal_data_module is created and therefore before there is anything to do things within it. I take it library(teal) should be added on top of the @code slot?

Is teal needed to reproduce the analysis? If we have a regular analysis, I think dplyr would be suffiecient. If we have any module that is attached, then it will be returned (like in i. - first point) and module will be dependend on teal, hence teal also will be loaded in the moment you run the code returned for the reproducibility?

Good point.

m7pr commented 6 months ago

Ok, so it's either a decision if we just leave a warning that some packages are missing, or we allow them to be installed, but in both cases we do put a code that checks if needed packages are installed.

I'm fine setting up a note, on which packages are missing and pasting a message to be copy-pasted on how to install them. How does that sound?

chlebowa commented 6 months ago

I think it's reasonable.

m7pr commented 6 months ago

On the main topic I don't see a reason why we do not allow to specify libraries in teal_data() |> within() block, as sometimes data processing requires extra packages. And I don't think we should require app developer or any body to understand which libraries are used in teal internals. To sum up, I think we should allow to include libraries in teal_data() |> within() block and also return libraries attached during teal app modules execution.

chlebowa commented 6 months ago

We do allow that, in fact we recommend it. We modified ALL examples to show that just this month.

m7pr commented 6 months ago

Alrighty, so what's left in here to decide (and potentially implement)?

chlebowa commented 6 months ago
  • āœ… we allow a user to specify libraries in within on the preprocessing
  • āœ… teal appends libraries with get_rcode_libraries

If these both hold true, there will be duplicates. get_rcode_libraries should be modified or abandoned altogether.

I am for putting the responsibility for library calls on the app dev. The possibility of modules modifying the search path should be addressed.

I still believe we should check for loaded namespaces and warn if they are not installed. Does anyone have any thoughts on this?

averissimo commented 6 months ago

Currently there is little possibility for duplicate library(...) calls due to https://github.com/insightsengineering/teal.data/issues/220, but once that is resolved I'm on the fence on the solution.

On one hand I'd like to leave this responsibility with the app developer. It's just a matter of moving the library around. However, we would need to change the current inheritance to prevent inheritance of parent.env(.GlobalEnv).

Right?

On the other hand, we can also remove duplicates library calls that exist on get_datasets_code and get_rcode_libraries (removing from the first as that'll be the subsquent calls).

This would require some regex magic, but it would be doable.

I still believe we should check for loaded namespaces and warn if they are not installed. Does anyone have any thoughts on this?

Agree and on it!! Otherwise it may end being frustrating to the user.

chlebowa commented 6 months ago

The duplicates are caused by get_rcode_libraries. The function will be removed and attaching packages will be up to the app dev.

The problem is get_rcode_libraries also adds library calls for loaded packages and that has to be handled differently.


EDIT: I the app dev decides to attach a package more than once, that's on them.

we can also remove duplicates library calls that exist on get_datasets_code and get_rcode_libraries (removing from the first as that'll be the subsquent calls)

I would say the opposite: keep the former, remove the latter. The purpose is to recreate the search path and packages that are on the path are not attached again.

averissimo commented 6 months ago

So the end goal here is to stop using get_rcode_libraries alltogether? That would remove those pesky {teal}-family includes that won't be relevant for reproducibility

I would say the opposite: keep the former, remove the latter. The purpose is to recreate the search path and packages that are on the path are not attached again.

Sure thing, I was considering keeping first occurence while trying to group all libraries together, but filtering out from there will be fine

chlebowa commented 6 months ago

So the end goal here is to stop using get_rcode_libraries alltogether? That would remove those pesky {teal}-family includes that won't be relevant for reproducibility

I believe so. As far as I know (this was before my time), get_rcode_libraries was introduced to create a semblance of a reproducible environment because - unlike with within - library calls were not explicitly included in preprocessing code.

averissimo commented 6 months ago

TODO: create a snippet that is added before library calls that verifies if all libraries are installed on the system that execute the code. if some are missing, create a string to be copy-pasted to install missing libraries.

This very much doable, but we will end up with {teal}.* and such in the "required" packages to have installed.

By "and such" it will have a bunch of weird dependencies not related with reproducible code (such as fontawesome)

image

AFAIK it's not possible to get the namespaces attached/loaded to a given environment. One way would be to track differences in loadedNamespaces in eval_code / within calls, but as shown below (šŸŸ ) it has caveats.

Q: Do you know of any way of doing this?

Toy example of tracking I think this is a bad idea, just showing here for the caveat marked with the orange circle in the comments (šŸŸ ) ``` r library(teal.data) #> Loading required package: teal.code qenv2 <- function() { q_env <- new.env(parent = emptyenv()) lockEnvironment(q_env, bindings = TRUE) methods::new("qenv", env = q_env) } pkgs <- \() vapply(utils::sessionInfo()$otherPkgs, base::`[[`, character(1L), "Package", USE.NAMES = FALSE) slot_update <- \(q_slot) { q_slot$namespace_slot <- union(q_slot$namespace_slot, setdiff(loadedNamespaces(), q_slot$previous_loaded)) q_slot$previous_loaded <- loadedNamespaces() |> sort() q_slot$attached_slot <- union(q_slot$attached_slot, setdiff(pkgs(), q_slot$previous_attached)) q_slot$previous_attached <- pkgs() |> sort() q_slot } q_slot <- structure( list( previous_loaded = loadedNamespaces() |> sort(), previous_attached = pkgs() |> sort(), attached_slot = character(0L), namespace_slot = character(0L) )#, class = c("q_slot", "list") ) # šŸŸ  Initial q_slot has `magrittr`, `rlang` and others that won't be detected in the differences # even if needed q_slot #> $previous_loaded #> [1] "backports" "base" "checkmate" "cli" "compiler" #> [6] "datasets" "digest" "evaluate" "fastmap" "fs" #> [11] "glue" "graphics" "grDevices" "htmltools" "knitr" #> [16] "lifecycle" "magrittr" "methods" "purrr" "R.cache" #> [21] "R.methodsS3" "R.oo" "R.utils" "reprex" "rlang" #> [26] "rmarkdown" "rstudioapi" "stats" "styler" "teal.code" #> [31] "teal.data" "tools" "utils" "vctrs" "withr" #> [36] "xfun" "yaml" #> #> $previous_attached #> [1] "teal.code" "teal.data" #> #> $attached_slot #> character(0) #> #> $namespace_slot #> character(0) q1 <- qenv() |> within({ aa <- 1 + 2 }) q_slot <- slot_update(q_slot) q_slot #> $previous_loaded #> [1] "backports" "base" "checkmate" "cli" "compiler" #> [6] "datasets" "digest" "evaluate" "fastmap" "fs" #> [11] "glue" "graphics" "grDevices" "htmltools" "knitr" #> [16] "lifecycle" "magrittr" "methods" "purrr" "R.cache" #> [21] "R.methodsS3" "R.oo" "R.utils" "reprex" "rlang" #> [26] "rmarkdown" "rstudioapi" "stats" "styler" "teal.code" #> [31] "teal.data" "tools" "utils" "vctrs" "withr" #> [36] "xfun" "yaml" #> #> $previous_attached #> [1] "teal.code" "teal.data" #> #> $attached_slot #> character(0) #> #> $namespace_slot #> character(0) q2 <- q1 |> within({ library(ggplot2) p <- ggplot(data = data.frame(x = 1:10, y = 1:10)) + geom_point(aes(x = x, y = y)) }) q_slot <- slot_update(q_slot) q_slot #> $previous_loaded #> [1] "backports" "base" "checkmate" "cli" "colorspace" #> [6] "compiler" "datasets" "digest" "dplyr" "evaluate" #> [11] "fansi" "fastmap" "fs" "generics" "ggplot2" #> [16] "glue" "graphics" "grDevices" "grid" "gtable" #> [21] "htmltools" "knitr" "lifecycle" "magrittr" "methods" #> [26] "munsell" "pillar" "pkgconfig" "purrr" "R.cache" #> [31] "R.methodsS3" "R.oo" "R.utils" "R6" "reprex" #> [36] "rlang" "rmarkdown" "rstudioapi" "scales" "stats" #> [41] "styler" "teal.code" "teal.data" "tibble" "tidyselect" #> [46] "tools" "utf8" "utils" "vctrs" "withr" #> [51] "xfun" "yaml" #> #> $previous_attached #> [1] "ggplot2" "teal.code" "teal.data" #> #> $attached_slot #> [1] "ggplot2" #> #> $namespace_slot #> [1] "gtable" "dplyr" "tidyselect" "scales" "ggplot2" #> [6] "R6" "generics" "tibble" "munsell" "pillar" #> [11] "utf8" "grid" "fansi" "colorspace" "pkgconfig" q3 <- q2 |> within({ f <- ggrepel::geom_label_repel }) q_slot <- slot_update(q_slot) q_slot #> $previous_loaded #> [1] "backports" "base" "checkmate" "cli" "colorspace" #> [6] "compiler" "datasets" "digest" "dplyr" "evaluate" #> [11] "fansi" "fastmap" "fs" "generics" "ggplot2" #> [16] "ggrepel" "glue" "graphics" "grDevices" "grid" #> [21] "gtable" "htmltools" "knitr" "lifecycle" "magrittr" #> [26] "methods" "munsell" "pillar" "pkgconfig" "purrr" #> [31] "R.cache" "R.methodsS3" "R.oo" "R.utils" "R6" #> [36] "Rcpp" "reprex" "rlang" "rmarkdown" "rstudioapi" #> [41] "scales" "stats" "styler" "teal.code" "teal.data" #> [46] "tibble" "tidyselect" "tools" "utf8" "utils" #> [51] "vctrs" "withr" "xfun" "yaml" #> #> $previous_attached #> [1] "ggplot2" "teal.code" "teal.data" #> #> $attached_slot #> [1] "ggplot2" #> #> $namespace_slot #> [1] "gtable" "dplyr" "tidyselect" "scales" "ggplot2" #> [6] "R6" "generics" "tibble" "munsell" "pillar" #> [11] "utf8" "grid" "fansi" "colorspace" "pkgconfig" #> [16] "Rcpp" "ggrepel" ``` Created on 2023-12-19 with [reprex v2.0.2](https://reprex.tidyverse.org)
chlebowa commented 6 months ago

TODO: create a snippet that is added before library calls that verifies if all libraries are installed on the system that execute the code. if some are missing, create a string to be copy-pasted to install missing libraries.

This subject was broached in a different issue. We do not want to automate installation. The usual Error in library(pkg) : there is no package called ā€˜pkgā€™ is quite enough IMO.

averissimo commented 6 months ago

Nonono, There's no installation automation, if you look closely at the screenshot it's installED.packages, so the snippets checks if they are installed. and outputs an error message

At the end is the snippet that I was using. We could use a library call for each one alternatively.

Calling library for each item in loadedNamespaces() will generate at least 27 lines (with {ggplot2} it grows to 42 and so forth)

Growth of loadedNamespaces ```r Filter( function(.x) .x %in% installed.packages(priority = NA_character_)[, "Package"], c(loadedNamespaces()) ) #> character(0) suppressMessages(library(teal)) Filter( function(.x) .x %in% installed.packages(priority = NA_character_)[, "Package"], c(loadedNamespaces()) ) #> [1] "backports" "digest" "later" "R6" #> [5] "httpuv" "fastmap" "teal.transform" "magrittr" #> [9] "teal.data" "glue" "shiny" "htmltools" #> [13] "logger" "lifecycle" "teal" "teal.code" #> [17] "promises" "cli" "xtable" "teal.logger" #> [21] "checkmate" "mime" "ellipsis" "yaml" #> [25] "Rcpp" "teal.slice" "rlang" suppressMessages(library(ggplot2)) Filter( function(.x) .x %in% installed.packages(priority = NA_character_)[, "Package"], c(loadedNamespaces()) ) #> [1] "gtable" "teal.slice" "dplyr" "promises" #> [5] "tidyselect" "Rcpp" "later" "scales" #> [9] "yaml" "fastmap" "mime" "teal.code" #> [13] "ggplot2" "R6" "generics" "backports" #> [17] "teal" "checkmate" "tibble" "logger" #> [21] "munsell" "shiny" "pillar" "rlang" #> [25] "utf8" "httpuv" "teal.data" "cli" #> [29] "withr" "magrittr" "digest" "xtable" #> [33] "lifecycle" "teal.logger" "vctrs" "glue" #> [37] "fansi" "colorspace" "teal.transform" "pkgconfig" #> [41] "ellipsis" "htmltools" ```

Snippet generation

# ...
missing_code <- substitute(
  missing <- Filter(function(.x) ! .x %in% installed.packages()[, "Package"], pkgs),
  list(pkgs = sort(
    Filter(
      function(.x) .x %in% installed.packages(priority = NA_character_)[, "Package"],
      loadedNamespaces()
    )
  ))
)

warning_code <- quote(
  if (length(missing))
    stop(paste(
        "Some of the libraries needed to reproduce the results are not installed:",
        paste0("  Please use: install.packages(c(", paste0("'", missing, "'", collapse = ", "), "))"),
        sep = "\n"
    ))
)
# ...

Error message (I'm mocking that pkg1 and pkg2 are loaded)

Error in eval(warning_code) : 
  Some of the libraries needed to reproduce the results are not installed:
  Please use: install.packages(c('pkg1', 'pkg2'))
m7pr commented 4 months ago

I know the issue has been abandoned for some time, but maybe work on this one can also bring a solution to this one https://github.com/insightsengineering/teal/issues/593