insightsengineering / teal.data

Data model for teal applications
https://insightsengineering.github.io/teal.data/
Other
9 stars 8 forks source link

Simplify data constructor for join_keys #162

Closed gogonzo closed 1 year ago

gogonzo commented 1 year ago

Feature description

The main reason of this issue is that single cdisc_dataset will be replaced by the data.frame. It means that cdisc_data call might look like this:

cdisc_data(
  adsl = adsl,
  adtte = adtte,
  join_keys = ???
)

We need to find an easy way to specify join_keys. Previously setting join_keys was dependent on each cdisc_dataset() which contained a default list of keys for given dataset name. Right now join_keys can't no longer depend on a partial information from each dataset, which makes the code painful in some cases. Consider scenario

keys_list <- teal.data:::default_cdisc_keys
jk <- teal.data::join_keys(
  teal.data::join_key("ADSL", "ADSL", keys = get_cdisc_keys("ADSL")),
  teal.data::join_key("ADAE", "ADAE", keys = get_cdisc_keys("ADAE")),
  teal.data::join_key("ADCM", "ADCM", keys = get_cdisc_keys("ADCM")),
  teal.data::join_key("ADEX", "ADEX", keys = get_cdisc_keys("ADEX")),
  teal.data::join_key("ADRS", "ADRS", keys = get_cdisc_keys("ADRS")),
  teal.data::join_key("ADLB", "ADLB", keys = get_cdisc_keys("ADLB")),
  teal.data::join_key("ADTR", "ADTR", keys = c("STUDYID", "USUBJID", "PARAMCD", "AVISIT")),
  teal.data::join_key("ADTRWF", "ADTRWF", keys = c("STUDYID", "USUBJID", "PARAMCD", "AVISIT")),
  teal.data::join_key("ADRSSWIM", "ADRSSWIM", keys = get_cdisc_keys("ADRS")),
  teal.data::join_key("ADAE", keys_list[["ADAE"]]$parent, keys = keys_list[["ADAE"]]$foreign),
  teal.data::join_key("ADCM", keys_list[["ADCM"]]$parent, keys = keys_list[["ADCM"]]$foreign),
  teal.data::join_key("ADEX", keys_list[["ADEX"]]$parent, keys = keys_list[["ADEX"]]$foreign),
  teal.data::join_key("ADRS", keys_list[["ADRS"]]$parent, keys = keys_list[["ADRS"]]$foreign),
  teal.data::join_key("ADLB", keys_list[["ADLB"]]$parent, keys = keys_list[["ADLB"]]$foreign),
  teal.data::join_key("ADTR", "ADSL", keys = c("STUDYID", "USUBJID", "PARAMCD", "AVISIT")),
  teal.data::join_key("ADTRWF", "ADSL", keys = c("STUDYID", "USUBJID", "PARAMCD", "AVISIT")),
  teal.data::join_key("ADRSSWIM", keys_list[["ADRS"]]$parent, keys = keys_list[["ADRS"]]$foreign)
)

Proposed solution

  1. Instead of differentiating cdisc_data and teal_data function, (cdisc) keys should be generated by separate function. Functions definitions should look like this:
teal_data <- function(..., join_keys = join_keys()) {
  # make teal_data code
}

cdisc_data <- function(..., join_keys = cdisc_join_keys(...)) {
  teal_data(..., join_keys = join_keys)
}
  1. Don't force users to specify dataset_2 when they specify "primary" join_key.

Look at join_key docs and on the example below. When specifying primary key, developer needs to specify join_key(dataset_1 = "data", dataset_2 = "data", keys = "id")

df_parent <- data.frame(id = 1:10, a = letters[1:10])
df_child <- expand.grid(parent_id = 1:10, param = 1:3)
df_child <- transform(df_child, b = rnorm(nrow(df_child)))

# currently
join_keys(
  join_key("df_parent", "df_parent", keys = "id"),
  join_key("df_child", "df_child", keys = c("parent_id", "param")),
  join_key("df_child", "df_parent", keys = c("parent_id" = "id"))
)

# desired
join_keys(
  join_key("df_parent", keys = "id"),
  join_key("df_child", keys = c("parent_id", "param")),
  join_key("df_child", "df_parent", keys = c("parent_id" = "id"))
)

# desired alternative
join_keys(
  primary_key("df_parent", keys = "id"),
  primary_key("df_child", keys = c("parent_id", "param")),
  join_key("df_child", "df_parent", keys = c("parent_id" = "id"))
)

Please fix a vignette about join keys.

averissimo commented 1 year ago

Created PR #179 for this issue :relaxed:

I wasn't here for the creation/refining discussion so please let me know if I missed anything.