insightsengineering / teal.data

Data model for teal applications
https://insightsengineering.github.io/teal.data/
Other
8 stars 8 forks source link

[Bug]: Merging two sibling ADaM datasets using default joining keys results in a cartesian product. #317

Open chlebowa opened 5 months ago

chlebowa commented 5 months ago

What happened?

Inference of joining keys by parent seems insufficient.

Consider a case where two child datasets are joined. ADLB and ADRS have the same primary keys, "STUDYID", "USUBJID", "PARAMCD", "AVISIT".

> library(teal.data)
> library(dplyr)
> 
> ADLB <- rADLB
> ADRS <- rADRS
>
> jk <- default_cdisc_join_keys["ADLB", "ADRS"]
>
> full_join(ADLB, ADRS) |> dim()
Joining with `by = join_by(STUDYID, USUBJID, SUBJID, SITEID, AGE, AGEU, SEX, RACE, ETHNIC, COUNTRY, DTHFL, INVID, INVNAM, ARM, ARMCD, ACTARM, ACTARMCD, TRT01P, TRT01A, TRT02P, TRT02A, REGION1, STRATA1,  
STRATA2, BMRKR1, BMRKR2, ITTFL, SAFFL, BMEASIFL, BEP01FL, AEWITHFL, RANDDT, TRTSDTM, TRTEDTM, TRT01SDTM, TRT01EDTM, TRT02SDTM, TRT02EDTM, AP01SDTM, AP01EDTM, AP02SDTM, AP02EDTM, EOSSTT, EOTSTT, EOSDT,   
EOSDY, DCSREAS, DTHDT, DTHCAUS, DTHCAT, LDDTHELD, LDDTHGR1, LSTALVDT, DTHADY, ADTHAUT, ASEQ, PARAM, PARAMCD, AVAL, ADTM, ADY, AVISIT, AVISITN)`
[1] 11600   104
> full_join(ADLB, ADRS, by = jk) |> dim()
[1] 67200   165
Warning message:
In full_join(ADLB, ADRS, by = jk) :
  Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 1 of `x` matches multiple rows in `y`.
ℹ Row 1 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship = "many-to-many"` to silence this warning.
>

Joining by default, i.e. using intersect(names(x), names(y)) correctly uses all primary keys as joining keys. Extracting a join_key_set from default_cdisc_join_keys results in a cartesian product.

sessionInfo()

> packageVersion("teal.data")
[1] '0.6.0'

Relevant log output

No response

Code of Conduct

Contribution Guidelines

Security Policy

averissimo commented 5 months ago

This is a byproduct of the incomplete specification of the CDISC datasets file and the implicit relationship inference.

The implicit relationship inference currently works:

image

This creates problem when the 2 datasets have additional shared primary keys, resulting in "unnecessary" duplicate rows when merging.

Possible solution

We could evolve the mechanism to:

POC Patch This would require just a few modifications. ```diff diff --git a/R/join_keys-utils.R b/R/join_keys-utils.R index 7ace0e74..6626755a 100644 --- a/R/join_keys-utils.R +++ b/R/join_keys-utils.R @@ -136,12 +136,14 @@ update_keys_given_parents <- function(x) { common_ix_1 <- unname(keys_d1_parent) %in% unname(keys_d2_parent) common_ix_2 <- unname(keys_d2_parent) %in% unname(keys_d1_parent) + pk_intersect <- intersect(jk[d1, d1], jk[d2, d2]) + # No common keys between datasets - leave empty if (all(!common_ix_1)) next fk <- structure( - names(keys_d2_parent)[common_ix_2], - names = names(keys_d1_parent)[common_ix_1] + union(pk_intersect, names(keys_d2_parent)[common_ix_2]), + names = union(pk_intersect, names(keys_d1_parent)[common_ix_1]) ) jk[[d1]][[d2]] <- fk # mutate join key } ```

Assumption on this:

This fix/enhancement would change the behavior of joining keys and we would need to communicate the change to our users. It's not a breaking change though.

Alternative solution (manual)

Expand the cdisc_datasets.yaml file to include joining keys for additional relationships between datasets that are different from parent.

I would advise against this as it would greatly expand on that file and its complexity