Consider migration to `dm`

gogonzo commented 2 years ago

Blocked by https://github.com/insightsengineering/teal/issues/628 - we need more insights about how chevron and teal fits together.

Research CDISCData, CDISCFilteredData to store dm object to manage relationships between data.frames object.
Research how MAE can be stored
Can dm be primary data object passed to the teal_module?

Some initial research

Easy initialization of the relational data
```
library(dm)
```

ADSL <- synthetic_cdisc_data("latest")$adsl ADTTE <- synthetic_cdisc_data("latest")$adtte ADRS <- synthetic_cdisc_data("latest")$adrs IRIS <- iris IRIS$id <- seq_len(nrow(IRIS))

dm <- dm(ADSL, ADTTE, ADRS, IRIS) |> dm_add_pk(ADSL, teal::get_cdisc_keys("ADSL")) |> dm_add_pk(ADTTE, teal::get_cdisc_keys("ADTTE")) |> dm_add_pk(ADRS, teal::get_cdisc_keys("ADRS")) |> dm_add_pk(IRIS, "id") |> dm_add_fk(table = ADTTE, columns = teal::get_cdisc_keys("ADSL"), ref_table = ADSL) |> dm_add_fk(table = ADRS, columns = teal::get_cdisc_keys("ADSL"), ref_table = ADSL)


2. Get keys 

```r
# get primary and foreign keys
dm |> dm_get_all_pks()
dm |> dm_get_all_fks()

Check metadata dm throws an error if the primary keys are duplicated

# check if keys are unique
dm |> dm_examine_constraints()
dm(ADSL, ADTTE, ADRS, IRIS) |>
dm_add_pk(ADSL, teal::get_cdisc_keys("ADSL")) |>
dm_add_pk(ADTTE, teal::get_cdisc_keys("ADSL")) |>
dm_examine_constraints()

dm holds data in list, so the data is copied by reference
```
# same objects
identical(dm[["ADSL"]], ADSL) 
```

Filters can be added to metadata with the dm_filter

# apply filter
dm |>
dm_filter(ADSL, SEX == "M") |>
dm_filter(ADRS, PARAMCD %in% c("BESRSPI", "INVET")) |>
dm_filter(ADRS, AVISIT %in% c("BASELINE", "SCREENING", "CYCLE2", "DAY1"))

# filter and select
dm |>
  dm_filter(ADSL, SEX == "M") |>
  dm_filter(ADRS, PARAMCD %in% c("BESRSPI", "INVET")) |>
  dm_apply_filters() |>
  dm_select(ADSL, USUBJID, STUDYID, AGE) |>
  dm_flatten_to_tbl(ADRS) |>
  nrow()

Joining is performed automatically by dm

# join tables
joined <- dm |>
dm_filter(ADSL, SEX == "M") |>
dm_filter(ADRS, PARAMCD %in% c("BESRSPI", "INVET")) |>
dm_apply_filters() |>
dm_select(ADSL, USUBJID, STUDYID, AGE) |>
dm_select(ADRS, USUBJID, STUDYID, PARAMCD, AVISIT, AGE, SEX, AVAL) |>
dm_join_to_tbl(ADSL, ADRS, join = left_join)

dm object is mutable

library(dm)
library(scda)

ADSL <- synthetic_cdisc_data("latest")$adsl
ADTTE <- synthetic_cdisc_data("latest")$adtte
ADRS <- synthetic_cdisc_data("latest")$adrs
IRIS <- iris
IRIS$id <- seq_len(nrow(IRIS))

dm <- dm(ADSL, ADTTE, ADRS, IRIS) 
dm2 <- dm |>
  dm_add_pk(ADSL, teal.data::get_cdisc_keys("ADSL")) |>
  dm_add_pk(ADTTE, teal.data::get_cdisc_keys("ADTTE")) |>
  dm_add_pk(ADRS, teal.data::get_cdisc_keys("ADRS")) |>
  dm_add_pk(IRIS, "id") |>
  dm_add_fk(table = ADTTE, columns = teal.data::get_cdisc_keys("ADSL"), ref_table = ADSL) |>
  dm_add_fk(table = ADRS, columns = teal.data::get_cdisc_keys("ADSL"), ref_table = ADSL)

lobstr::obj_addr(dm)
lobstr::obj_addr(dm2)

dm operations preserve attributes


dm_filtered <- dm |>
dm_filter(ADSL, SEX == "M") |>
dm_filter(ADRS, PARAMCD %in% c("BESRSPI", "INVET")) |>
dm_apply_filters() |>
dm_select(ADSL, USUBJID, STUDYID, AGE) |>
dm_select(ADRS, USUBJID, STUDYID, PARAMCD, AVISIT, AGE, SEX, AVAL) |>
dm_select_tbl(ADSL, ADRS) |>
dm_flatten_to_tbl(ADRS, ADSL)

attr(joined, "label") formatters::var_labels(joined)


 Issues:
 - `dm` renames duplicated the columns automatically so we might have a problem with difference between input column names and output column names.  `input$select_from_adsl` and `input$select_from_adrs` won't be found in `colnames(data)`
 - ~won't support same dataset selectors by default. Following will not possible~ 
     Not needed since we are going to fix data_merge https://github.com/insightsengineering/NEST-roadmap/issues/36
    ```r
    ANL1 <- ADRS %>% filter(PARAMCD == "PARAM1") %>% select(<keys without PARAMCD>, PARAM1 = AVAL) 
    ANL2 <- ADRS %>% filter(PARAMCD == "PARAM2") %>% select(<keys without, PARAMCD>, PARAM2 = AVAL)
    ANL <- merge(ANL1, ANL2, by = c(<keys without PARAMCD>))

Maybe we should reconsider this problematic option and just to do this: 
```r
ANL <- ADRS %>% filter(PARAMCD %in% c("PARAM1", "PARAM2")_) %>% select(<keys>, AVAL)
```

kpagacz commented 2 years ago

Instead of replacing what we currently have with dm let's think about how we can support dm in the filter panel. I recommend first looking into how to make filter panel work with dm and then go to the data models. It goes to what I think should be the API of a teal module (so raw data passed independently of the filter state with a possibility of the filtering state being applied to the data inside the module).

gogonzo commented 2 years ago

@kpagacz Do you think about having FilteredData.dm? Could you elaborate more how would you expect this to look like?

gogonzo commented 2 years ago

Having dm in the filter panel could be a obstacle to have independent datasets reactivity. dm is one object so every change in the object (filter in any dataset) will trigger reactivity of the whole object, as we can't separate each dataset from the reactivity map. See the example - change the filter in IRIS and ADSL-related output will wait for computation. Currently in the filter panel only child/parents datasets reactively-depended, other are unconnected.

library(shiny)
library(dm)
library(scda)

ADSL <- synthetic_cdisc_data("latest")$adsl
ADTTE <- synthetic_cdisc_data("latest")$adtte
ADRS <- synthetic_cdisc_data("latest")$adrs
IRIS <- iris
IRIS$id <- seq_len(nrow(IRIS))

dm <- dm(ADSL, ADTTE, ADRS, IRIS) |>
  dm_add_pk(ADSL, teal.data::get_cdisc_keys("ADSL")) |>
  dm_add_pk(ADTTE, teal.data::get_cdisc_keys("ADTTE")) |>
  dm_add_pk(ADRS, teal.data::get_cdisc_keys("ADRS")) |>
  dm_add_pk(IRIS, "id") |>
  dm_add_fk(table = ADTTE, columns = teal.data::get_cdisc_keys("ADSL"), ref_table = ADSL) |>
  dm_add_fk(table = ADRS, columns = teal.data::get_cdisc_keys("ADSL"), ref_table = ADSL)

shinyApp(
  ui = fluidPage(
    sidebarPanel(
      selectInput("Species", "Select species", choices = unique(iris$Species), selected = "Setosa"),
      verbatimTextOutput("ADSL_rows"),
      verbatimTextOutput("iris_rows")
    )
  ),
  server = function(input, output, session) {
    filtered_data <- eventReactive(input$Species, {
      dm_filter(dm, IRIS, Species %in% input$Species)
    })

    filtered_adsl <- reactive(filtered_data()[["ADSL"]])
    filtered_iris <- reactive(filtered_data()[["IRIS"]])

    output$ADSL_rows <- renderText({
      print("ADSL triggered")
      Sys.sleep(2)
      nrow(filtered_adsl())
    })

    output$iris_rows <- renderText({
      print("IRIS triggered")
      Sys.sleep(2)
      nrow(filtered_iris())
    })
  }
)

gogonzo commented 2 years ago

I have an opinion that dm can fix some of our problems but will introduce other:

FilterPanel - We can't build any reactivity between datasets which will trigger re-computation of all reactives regardless of datasets they are using. Specifically, in our case it means that each module within app will be affected by the change in any dataset. We might consider though usage of the same datasets in all modules, which is not a good idea as users requested filter-panel display being module specific. UPDATE: Above is not necessary true - bindCache can actually handle if particular datasets changed.
If we allow "choices_selected" from multiple datasets, then dm wouldn't help us so much with the merge - when one selects duplicated variable names from different datasets then dm renames them automatically, making them hard to track in the merged dataset (unless our merge_datasets module will provide this information about renamed variables)

insightsengineering / teal.data

Consider migration to `dm` #8

Some initial research