pharmaverse / sdtm.oak

An EDC and Data Standard agnostic SDTM data transformation engine that automates the transformation of raw clinical data in ODM format to SDTM based on standard mapping algorithms
https://pharmaverse.github.io/sdtm.oak/
Apache License 2.0
22 stars 6 forks source link

Feature Request: raw_filter and tgt_filter parameters #54

Closed rammprasad closed 1 month ago

rammprasad commented 2 months ago

Feature Idea

Introduce functionality to filter the raw and target datasets while performing a mapping.

Example If conditions

  1. Involving raw_dat

If [AESOS.AESO] == 1 and [AESOS.AESOSP] is null then hardcode OE.OEORRES = 'Y'

AESOS is the raw dataset and AESO, AESOSP are variables in the raw dataset. OE is the target domain and OEORRES is the target variable. ​​

hardcode_no_ct(
  raw_dat = AESOS,
  raw_var = AESO,
  tgt_var = OEORRES,
  tgt_val = "Y",
  tgt_dat = OE_INTER,
  raw_filter = (AESO == 1 && !is.na(AESOSP)),
  tgt_filter = NULL,
  id_vars =[ oak_id_vars](https://pharmaverse.github.io/sdtm.oak/reference/oak_id_vars.html)()
)

If [AESOS.AESO] == 1 and [AESOS.AESOSP] is null then hardcode OE.OETESTCD = 'IOISYMPO'

hardcode_ct(
  raw_dat = AESOS,
  raw_var = AETERM,
  tgt_var = OETESTCD,
  tgt_val = 'IOISYMPO',
  ct_spec = study_ct,
  ct_clst = "C123456",
  tgt_dat = NULL,
  raw_filter = (AESO == 1 && is.null(AESOSP)),
  tgt_filter = NULL,
  id_vars =[ oak_id_vars](https://pharmaverse.github.io/sdtm.oak/reference/oak_id_vars.html)()
)
  1. Involving tgt_dat

If VS.VSTESTCD = 'TEMP', assign the value collected in VTLS1.TEMPLOC to VS.VSLOC.

VTLS1 is the raw dataset name and TEMPLOC is a variable in the raw dataset. VS is the target domain and VSLOC is derived.

assign_ct(
  raw_dat = VTLS1,
  raw_var = "TEMPLOC",
  tgt_var = "VSLOC",
  ct_spec = study_ct,
  ct_clst = "C12123431",
  tgt_dat = vs_inter,
  raw_filter = NULL,
  tgt_filter = (VSTESTCD == "TEMP"),
  id_vars =[ oak_id_vars](https://pharmaverse.github.io/sdtm.oak/reference/oak_id_vars.html)()
)

Involving raw_dat and tgt_dat but separate conditions

If [AECOV19.SPECTYP] is not null, and FA.FATESTCD = 'STATUS' and FA.FAOBJ = 'Severe Acute Resp Syndrome Coronavirus 2' assign the value collected in SPCNM to then FA.FASPEC.

In this example AECOV19 is the raw dataset name, SPECTYP is a variable in the raw dataset. THe condition also involved the target domain FA, FAOBJ nad FATESTCD are previously derived SDTM variables and FASPEC is the SDTM variable that is currently derived.

assign_ct(
  raw_dat = AECOV19,
  raw_var = "SPCNM",
  tgt_var = "FASPEC",
  ct_spec = study_ct,
  ct_clst = "C1212121",
  tgt_dat = fa_inter,
  raw_filter = (is.null(SPECTYP)),
  tgt_filter = (FATESTCD == "STATUS" && FAOBJ  == "Severe Acute Resp Syndrome Coronavirus 2"),
  id_vars =[ oak_id_vars](https://pharmaverse.github.io/sdtm.oak/reference/oak_id_vars.html)()
)

Involving raw_dat and tgt_dat in the same condition We may not be able to support this.

MH.MHLOC when MH.MHTERM = [GCAHX.NCITERM] or [GCAHX.NCITERMO]

Relevant Input

No response

Relevant Output

No response

Reproducible Example/Pseudo Code

No response

ramiromagno commented 2 months ago

As per discussion with @rammprasad I will implement instead a separate function to mark records in tibbles for filtering. The new function will be: condition_by().

rammprasad commented 2 months ago

As per discussion with @rammprasad I will implement instead a separate function to mark records in tibbles for filtering. The new function will be: condition_by().

Shall we name the function add_cond() or add_condition()?

rammprasad commented 2 months ago

The example will look like below.

Example If conditions

Involving raw_dat

If [AESOS.AESO] == 1 and [AESOS.AESOSP] is null then hardcode OE.OEORRES = 'Y'

AESOS is the raw dataset and AESO, AESOSP are variables in the raw dataset. OE is the target domain and OEORRES is the target variable. ​​

hardcode_no_ct(
  raw_dat = add_cond(AESOS, AESO == 1 && !is.na(AESOSP)),
  raw_var = AESO,
  tgt_var = OEORRES,
  tgt_val = "Y",
  tgt_dat = OE_INTER,
  id_vars = oak_id_vars()
)

If [AESOS.AESO] == 1 and [AESOS.AESOSP] is null then hardcode OE.OETESTCD = 'IOISYMPO'

hardcode_ct(
  raw_dat = add_cond(AESOS, AESO == 1 && is.null(AESOSP)),
  raw_var = AETERM,
  tgt_var = OETESTCD,
  tgt_val = 'IOISYMPO',
  ct_spec = study_ct,
  ct_clst = "C123456",
 id_vars = oak_id_vars()
)

Involving tgt_dat

If VS.VSTESTCD = 'TEMP', assign the value collected in VTLS1.TEMPLOC to VS.VSLOC.

VTLS1 is the raw dataset name and TEMPLOC is a variable in the raw dataset. VS is the target domain and VSLOC is derived.

#when using in-pipe
|>
assign_ct(
  raw_dat = VTLS1,
  raw_var = "TEMPLOC",
  tgt_var = "VSLOC",
  ct_spec = study_ct,
  ct_clst = "C12123431",
  tgt_dat = add_cond(.data, VSTESTCD == "TEMP"),
  raw_filter = NULL,
  id_vars = oak_id_vars()
)

Involving raw_dat and tgt_dat but separate conditions

If [AECOV19.SPECTYP] is not null, and FA.FATESTCD = 'STATUS' and FA.FAOBJ = 'Severe Acute Resp Syndrome Coronavirus 2' assign the value collected in SPCNM to then FA.FASPEC.

In this example, AECOV19 is the raw dataset name, and SPECTYP is a variable in the raw dataset. The condition also involved the target domain FA. FAOBJ and FATESTCD are previously derived SDTM variables, and FASPEC is the SDTM variable that is currently derived.

#when using in-pipe
|>
assign_ct(
  raw_dat = add_cond(AECOV19,  is.null(SPECTYP)),
  raw_var = "SPCNM",
  tgt_var = "FASPEC",
  ct_spec = study_ct,
  ct_clst = "C1212121",
  tgt_dat = add_cond(.data,  FATESTCD == "STATUS" && FAOBJ  == "Severe Acute Resp Syndrome Coronavirus 2"),
  id_vars = oak_id_vars()
)

Involving raw_dat and tgt_dat in the same condition

We may not be able to support this. Take a look and let me know @ramiromagno

MH.MHLOC when MH.MHTERM = [GCAHX.NCITERM] or [GCAHX.NCITERMO]

Map the collected value in GCAHX raw_dat locat raw_varialble to MH.MHLOC when this condition is met MH.MHTERM = [GCAHX.NCITERM] or [GCAHX.NCITERMO]

#when using in-pipe
|>
assign_ct(
  raw_dat = GCAHX
  raw_var = "SPCNM",
  tgt_var = "FASPEC",
  ct_spec = study_ct,
  ct_clst = "C1212121",
  tgt_dat = add_cond(.data,  MHTERM %in% GCAHX$NCITERM || MHTERM %in% GCAHX$NCITERM O),
  id_vars = oak_id_vars()
)
ramiromagno commented 2 months ago

To help understand that use case involving variables of raw_dat and tgt_dat in the same condition, could you share how you currently do it with roak's if_then_else() interface?

ramiromagno commented 2 months ago

What should happen if the condition results in NA?

rammprasad commented 1 month ago

To help understand that use case involving variables of raw_dat and tgt_dat in the same condition, could you share how you currently do it with roak's if_then_else() interface?

The {roak} processes it very differently, and it is driven by metadata. The main branch has an example. Please refer to the example mapping CMMODIFTY with the annotation text If different to CM.CMTRT, then CM.CMMODIFY means the mapping will happen if the value in the collected column CMMODIFY is different from the CMTRT. It is carried out using the spec parameters condition_left, condition_right, and condition_operator. {roak} reads it and processes the logical condition. it is a bit confusing as at the moment the name of the variable in CMMODIFY in the raw_dataset and in the target domain CM. I will change it in the raw_dataset

A mock of automation of this in {roak} will look like

 # Derive qualifier CMMODIFY  Annotation text = If different to CM.CMTRT then CM.CMMODIFY
  if_then_else(
    raw_dat = cm_raw,
    raw_var = CMMODIFY,
    condition_left_raw_dataset = cm_raw,
    condition_left_raw_variable = CMMODIFY,
    condition_operator = "diffferent_to",
    condition_right_sdtm_variable_domain = CM,
    condition_right_sdtm_variable = CMTRT,
    sub_algorithm = assign_no_ct,
    tgt_var = CMDOSETXT,
    id_vars = oak_id_vars()
  ) |>

Can we do something like this in {sdtm.oak} where filtering needs to happen based on a condition in raw_dat and tar_dat?

 # Derive qualifier CMMODIFY  Annotation text  If collected value in CMMODIFY in cm_raw is different to CM.CMTRT then
  # assign the collected value to CMMODIFY in the CM domain (CM.CMMODIFY)
  assign_no_ct(
    raw_dat = cm_raw,
    raw_var = "CMMODIFY",
    add_cond = (cm_raw$CMMODIFY == .data$CMTRT),
    tgt_var = "CMMODIFY",
    id_vars = oak_id_vars()
  )
rammprasad commented 1 month ago

What should happen if the condition results in NA?

If no records match the criteria, we create the tgt_var as an empty column.

rammprasad commented 1 month ago

Preferred option to handle complex if condition.

when using in-pipe

|> assign_ct( raw_dat = GCAHX raw_var = "SPCNM", tgt_var = "FASPEC", ct_spec = study_ct, ct_clst = "C1212121", tgt_dat = add_cond(.data$MHTERM %in% GCAHX$NCITERM || .data$MHTERM %in% GCAHX$NCITERMO), id_vars = oak_id_vars() )