insightsengineering / teal.data

Data model for teal applications
https://insightsengineering.github.io/teal.data/
Other
8 stars 7 forks source link

Consider migration to `dm` #8

Open gogonzo opened 2 years ago

gogonzo commented 2 years ago

Blocked by https://github.com/insightsengineering/teal/issues/628 - we need more insights about how chevron and teal fits together.


Some initial research

  1. Easy initialization of the relational data
    
    library(dm)

ADSL <- synthetic_cdisc_data("latest")$adsl ADTTE <- synthetic_cdisc_data("latest")$adtte ADRS <- synthetic_cdisc_data("latest")$adrs IRIS <- iris IRIS$id <- seq_len(nrow(IRIS))

dm <- dm(ADSL, ADTTE, ADRS, IRIS) |> dm_add_pk(ADSL, teal::get_cdisc_keys("ADSL")) |> dm_add_pk(ADTTE, teal::get_cdisc_keys("ADTTE")) |> dm_add_pk(ADRS, teal::get_cdisc_keys("ADRS")) |> dm_add_pk(IRIS, "id") |> dm_add_fk(table = ADTTE, columns = teal::get_cdisc_keys("ADSL"), ref_table = ADSL) |> dm_add_fk(table = ADRS, columns = teal::get_cdisc_keys("ADSL"), ref_table = ADSL)


2. Get keys 

```r
# get primary and foreign keys
dm |> dm_get_all_pks()
dm |> dm_get_all_fks()
  1. Check metadata dm throws an error if the primary keys are duplicated

    # check if keys are unique
    dm |> dm_examine_constraints()
    dm(ADSL, ADTTE, ADRS, IRIS) |>
    dm_add_pk(ADSL, teal::get_cdisc_keys("ADSL")) |>
    dm_add_pk(ADTTE, teal::get_cdisc_keys("ADSL")) |>
    dm_examine_constraints() 
  2. dm holds data in list, so the data is copied by reference

    # same objects
    identical(dm[["ADSL"]], ADSL) 
  3. Filters can be added to metadata with the dm_filter

    # apply filter
    dm |>
    dm_filter(ADSL, SEX == "M") |>
    dm_filter(ADRS, PARAMCD %in% c("BESRSPI", "INVET")) |>
    dm_filter(ADRS, AVISIT %in% c("BASELINE", "SCREENING", "CYCLE2", "DAY1"))
# filter and select
dm |>
  dm_filter(ADSL, SEX == "M") |>
  dm_filter(ADRS, PARAMCD %in% c("BESRSPI", "INVET")) |>
  dm_apply_filters() |>
  dm_select(ADSL, USUBJID, STUDYID, AGE) |>
  dm_flatten_to_tbl(ADRS) |>
  nrow()
  1. Joining is performed automatically by dm
    # join tables
    joined <- dm |>
    dm_filter(ADSL, SEX == "M") |>
    dm_filter(ADRS, PARAMCD %in% c("BESRSPI", "INVET")) |>
    dm_apply_filters() |>
    dm_select(ADSL, USUBJID, STUDYID, AGE) |>
    dm_select(ADRS, USUBJID, STUDYID, PARAMCD, AVISIT, AGE, SEX, AVAL) |>
    dm_join_to_tbl(ADSL, ADRS, join = left_join)
    1. dm object is mutable
library(dm)
library(scda)

ADSL <- synthetic_cdisc_data("latest")$adsl
ADTTE <- synthetic_cdisc_data("latest")$adtte
ADRS <- synthetic_cdisc_data("latest")$adrs
IRIS <- iris
IRIS$id <- seq_len(nrow(IRIS))

dm <- dm(ADSL, ADTTE, ADRS, IRIS) 
dm2 <- dm |>
  dm_add_pk(ADSL, teal.data::get_cdisc_keys("ADSL")) |>
  dm_add_pk(ADTTE, teal.data::get_cdisc_keys("ADTTE")) |>
  dm_add_pk(ADRS, teal.data::get_cdisc_keys("ADRS")) |>
  dm_add_pk(IRIS, "id") |>
  dm_add_fk(table = ADTTE, columns = teal.data::get_cdisc_keys("ADSL"), ref_table = ADSL) |>
  dm_add_fk(table = ADRS, columns = teal.data::get_cdisc_keys("ADSL"), ref_table = ADSL)

lobstr::obj_addr(dm)
lobstr::obj_addr(dm2)
  1. dm operations preserve attributes
    
    dm_filtered <- dm |>
    dm_filter(ADSL, SEX == "M") |>
    dm_filter(ADRS, PARAMCD %in% c("BESRSPI", "INVET")) |>
    dm_apply_filters() |>
    dm_select(ADSL, USUBJID, STUDYID, AGE) |>
    dm_select(ADRS, USUBJID, STUDYID, PARAMCD, AVISIT, AGE, SEX, AVAL) |>
    dm_select_tbl(ADSL, ADRS) |>
    dm_flatten_to_tbl(ADRS, ADSL)

attr(joined, "label") formatters::var_labels(joined)


 Issues:
 - `dm` renames duplicated the columns automatically so we might have a problem with difference between input column names and output column names.  `input$select_from_adsl` and `input$select_from_adrs` won't be found in `colnames(data)`
 - ~won't support same dataset selectors by default. Following will not possible~ 
     Not needed since we are going to fix data_merge https://github.com/insightsengineering/NEST-roadmap/issues/36
    ```r
    ANL1 <- ADRS %>% filter(PARAMCD == "PARAM1") %>% select(<keys without PARAMCD>, PARAM1 = AVAL) 
    ANL2 <- ADRS %>% filter(PARAMCD == "PARAM2") %>% select(<keys without, PARAMCD>, PARAM2 = AVAL)
    ANL <- merge(ANL1, ANL2, by = c(<keys without PARAMCD>))
Maybe we should reconsider this problematic option and just to do this: 
```r
ANL <- ADRS %>% filter(PARAMCD %in% c("PARAM1", "PARAM2")_) %>% select(<keys>, AVAL)
```
kpagacz commented 2 years ago

Instead of replacing what we currently have with dm let's think about how we can support dm in the filter panel. I recommend first looking into how to make filter panel work with dm and then go to the data models. It goes to what I think should be the API of a teal module (so raw data passed independently of the filter state with a possibility of the filtering state being applied to the data inside the module).

gogonzo commented 2 years ago

@kpagacz Do you think about having FilteredData.dm? Could you elaborate more how would you expect this to look like?

gogonzo commented 2 years ago

Having dm in the filter panel could be a obstacle to have independent datasets reactivity. dm is one object so every change in the object (filter in any dataset) will trigger reactivity of the whole object, as we can't separate each dataset from the reactivity map. See the example - change the filter in IRIS and ADSL-related output will wait for computation. Currently in the filter panel only child/parents datasets reactively-depended, other are unconnected.

library(shiny)
library(dm)
library(scda)

ADSL <- synthetic_cdisc_data("latest")$adsl
ADTTE <- synthetic_cdisc_data("latest")$adtte
ADRS <- synthetic_cdisc_data("latest")$adrs
IRIS <- iris
IRIS$id <- seq_len(nrow(IRIS))

dm <- dm(ADSL, ADTTE, ADRS, IRIS) |>
  dm_add_pk(ADSL, teal.data::get_cdisc_keys("ADSL")) |>
  dm_add_pk(ADTTE, teal.data::get_cdisc_keys("ADTTE")) |>
  dm_add_pk(ADRS, teal.data::get_cdisc_keys("ADRS")) |>
  dm_add_pk(IRIS, "id") |>
  dm_add_fk(table = ADTTE, columns = teal.data::get_cdisc_keys("ADSL"), ref_table = ADSL) |>
  dm_add_fk(table = ADRS, columns = teal.data::get_cdisc_keys("ADSL"), ref_table = ADSL)

shinyApp(
  ui = fluidPage(
    sidebarPanel(
      selectInput("Species", "Select species", choices = unique(iris$Species), selected = "Setosa"),
      verbatimTextOutput("ADSL_rows"),
      verbatimTextOutput("iris_rows")
    )
  ),
  server = function(input, output, session) {
    filtered_data <- eventReactive(input$Species, {
      dm_filter(dm, IRIS, Species %in% input$Species)
    })

    filtered_adsl <- reactive(filtered_data()[["ADSL"]])
    filtered_iris <- reactive(filtered_data()[["IRIS"]])

    output$ADSL_rows <- renderText({
      print("ADSL triggered")
      Sys.sleep(2)
      nrow(filtered_adsl())
    })

    output$iris_rows <- renderText({
      print("IRIS triggered")
      Sys.sleep(2)
      nrow(filtered_iris())
    })
  }
)
gogonzo commented 2 years ago

I have an opinion that dm can fix some of our problems but will introduce other:

  1. FilterPanel - We can't build any reactivity between datasets which will trigger re-computation of all reactives regardless of datasets they are using. Specifically, in our case it means that each module within app will be affected by the change in any dataset. We might consider though usage of the same datasets in all modules, which is not a good idea as users requested filter-panel display being module specific. UPDATE: Above is not necessary true - bindCache can actually handle if particular datasets changed.

  2. If we allow "choices_selected" from multiple datasets, then dm wouldn't help us so much with the merge - when one selects duplicated variable names from different datasets then dm renames them automatically, making them hard to track in the merged dataset (unless our merge_datasets module will provide this information about renamed variables)