pharmaverse / sdtm.oak

An EDC and Data Standard agnostic SDTM data transformation engine that automates the transformation of raw clinical data in ODM format to SDTM based on standard mapping algorithms
https://pharmaverse.github.io/sdtm.oak/
Apache License 2.0
25 stars 7 forks source link

Develop a function to generate --SEQ. #15

Closed houtel closed 4 months ago

houtel commented 11 months ago

Feature Idea

Purpose Generate --SEQ when the set of columns which define the natural key for a domain and an initial value for each USUBJID (defaulted to 1) are provided by parameter.

Functionality The feature should generate --SEQ for a given domain using the following algorithm. Note that this algorithm assumes that any split domains are combined into a single data frame prior to generating the --SEQ column. 1) Sort the domain by the provided natural key. If the keys do not identify distinct rows, produce a warning. 2) Set --SEQ to the initial value (provided by parameter) for the first record having a given USUBJID. The column name for --SEQ is determined by concatenating the value of the DOMAIN column with “SEQ” (e.g. VSSEQ when DOMAIN = “VS”). 3) Increment --SEQ by 1 for each successive record for the given USUBJID.

Relevant Input

Data frame containing all domain columns except for the --SEQ column and a vector containing the natural key columns for the domain.

Relevant Output

Data frame containing all domain columns including the --SEQ column.

Reproducible Example/Pseudo Code

generate_seq (tar_dat, tar_var = "xxSEQ" , key_vars = c(“USUBJID”, “XXCAT”, “XXSCAT”, “XXTERM”), init_val = 1) Example: generate_seq (ds, "dsseq", key_vars = c(“USUBJID”, “DSCAT”, “DSSCAT”, “DSTERM”))

NOTE: CRAN package {sdtmval} contains assign_SEQ function which could be utilized for this purpose unless we want to minimize the dependency.

venkatamaguluri commented 11 months ago

Comments from Venkata Maguluri (Pfizer)****

  1. Not necessarily starting value can be "1" all the time hence we give control by end user.
  2. Assume data has been sorted based on key elements before calling this function.
  3. ensure data sorting did not changed during SEQ assignment.
edgar-manukyan commented 11 months ago

In roak we warn user if the keys do not identify distinct rows. Shall we build such functionality as well? We can help them to identify those duplicate rows.

edgar-manukyan commented 11 months ago

We can try if to determine whether data is already sorted.

edgar-manukyan commented 11 months ago

As a separate step, in roak we convert all empty string values "" into NAcharacter in all raw datasets before any of the functions start manipulating the data.

ynsec37 commented 10 months ago

Dear developer,

If the function will work like the tidyeval way that quotations are not needed for data frame vairables generate_seq (ds_temp, key_vars = c(USUBJID, DSCAT, DSSCAT, DSTERM)), and there is a similar function in admiral::derive_var_obs_number

ramiromagno commented 6 months ago

Hi Adam (@galachad):

I see that this issue is assigned to you, but it hasn't seen any activity for a long while. Are you still working on this?

rammprasad commented 5 months ago

Reference function in {roak} - https://github.com/pharmaverse/roak_pilot/blob/main/R/oak_derive_seq.R

ramiromagno commented 4 months ago

Hi @rammprasad and @edgar-manukyan:

Can't you provide a few input and output data set examples for me to test my implementation?

edgar-manukyan commented 4 months ago

@ramiromagno, please find the domain_key_variables.csv which determines how the domain needs to be sorted (ideally into unique rows) then the --SEQ variable gets derived.

Test 1

  ds_in <- tibble::tribble(
    ~STUDYID, ~DOMAIN,      ~USUBJID,                  ~VSSPID, ~VSTESTCD,             ~VSDTC, ~VSTPTNUM,
    "ABC123",    "VS",  "ABC123-375", "/F:VTLS1-D:9795532-R:2",   "DIABP", "2020-09-01T13:31",        NA,
    "ABC123",    "VS",  "ABC123-375", "/F:VTLS1-D:9795532-R:2",    "TEMP", "2020-09-01T13:31",        NA,
    "ABC123",    "VS",  "ABC123-375", "/F:VTLS2-D:9795533-R:2",   "DIABP", "2020-09-28T11:00",         2,
    "ABC123",    "VS",  "ABC123-375", "/F:VTLS2-D:9795533-R:2",    "TEMP", "2020-09-28T11:00",         2,
    "ABC123",    "VS",  "ABC123-376", "/F:VTLS1-D:9795591-R:1",   "DIABP",       "2020-09-20",        NA,
    "ABC123",    "VS",  "ABC123-376", "/F:VTLS1-D:9795591-R:1",    "TEMP",       "2020-09-20",        NA
  )
  result <- oak_derive_seq(ds_in)

  expect_equal(result$VSSEQ,
               c(1L, 2L, 3L, 4L, 1L, 2L))

Test 2

  ds_in <- tibble::tribble(
    ~STUDYID, ~DOMAIN,      ~USUBJID,                  ~VSSPID,
    "ABC123",    "ZZ",  "ABC123-375", "/F:VTLS1-D:9795532-R:2",
  )

  expect_error(
    oak_derive_seq(ds_in),
    paste(
      "ZZ domain keys must be in the domain_key_variables.csv",
      "Please update the file and use oak_load_study_config().",
      sep = "\n"
    )
  )

Test 3

  ds_in <- tibble::tribble(
    ~STUDYID,      ~RSUBJID,    ~SCTESTCD, ~DOMAIN,     ~SREL,           ~SCCAT,
    "ABC123",  "ABC123-210",   "LVSBJIND",  "APSC",  "FRIEND", "CAREGIVERSTUDY",
    "ABC123",  "ABC123-210",   "EDULEVEL",  "APSC",  "FRIEND", "CAREGIVERSTUDY",
    "ABC123",  "ABC123-210",     "TMSPPT",  "APSC",  "FRIEND", "CAREGIVERSTUDY",
    "ABC123",  "ABC123-211",    "CAREDUR",  "APSC", "SIBLING", "CAREGIVERSTUDY",
    "ABC123",  "ABC123-211",   "LVSBJIND",  "APSC", "SIBLING", "CAREGIVERSTUDY",
    "ABC123",  "ABC123-212",    "JOBCLAS",  "APSC",  "SPOUSE", "CAREGIVERSTUDY"
  )

  result <- oak_derive_seq(ds_in)

  expect_equal(result$SCSEQ,
               c(1L, 2L, 3L, 1L, 2L, 1L))
ramiromagno commented 4 months ago

Thanks @edgar-manukyan!

In test 3, ds_in does not contain all key variables. According to file domain_key_variables.csv, these variables: USUBJID, SCSPID, SCTESTCD and VISITNUM should also be there, isn't it? How could then the function oak_derive_seq() work in that case?

edgar-manukyan commented 4 months ago

Thanks @edgar-manukyan!

In test 3, ds_in does not contain all key variables. According to file domain_key_variables.csv, these variables: USUBJID, SCSPID, SCTESTCD and VISITNUM should also be there, isn't it? How could then the function oak_derive_seq() work in that case?

Awesome observation @ramiromagno. This is testing so called associated person domain and I see in the roak https://github.com/pharmaverse/roak_pilot/blob/main/R/oak_derive_seq.R#L42

ramiromagno commented 4 months ago

I see, sorry for the oversight!

BTW: Just one more question: is the domain_key_variables.csv comprehensive?

ramiromagno commented 4 months ago

I'm sorry if I am overseeing something here again, but if the domain is APSC, shouldn't the column APID be there in ds_in?

edgar-manukyan commented 4 months ago

I'm sorry if I am overseeing something here again, but if the domain is APSC, shouldn't the column APID be there in ds_in?

Interestingly roak just ignores them and you should ask Ram about this :) https://github.com/pharmaverse/roak_pilot/blob/main/R/oak_derive_seq.R#L80

ramiromagno commented 4 months ago

I see. Could it be that not all keys are mandatory? There might be a few that are optional, and in that case it could fine to sort only with what is available...? @rammprasad help please! :)

edgar-manukyan commented 4 months ago

I see, sorry for the oversight!

BTW: Just one more question: is the domain_key_variables.csv comprehensive?

No worries, you are picking up SDTM concepts so quickly. After three years, I still feel dizzy about it. The attached file was used for the tests. This one domain_key_variables (2).csv is more comprehensive, though as Ram said it is dynamic and study teams will change it based on their setup. That's the reason why they call it a configuration file.

ramiromagno commented 4 months ago

Thank you @edgar-manukyan, that really helps! You're the best. I thought those set of variables used for sorting were the actual keys that defined a record in a specific SDTM domain data set. Isn't this set on stone in the standard?

armenic commented 4 months ago

Thank you @edgar-manukyan, that really helps! You're the best. I thought those set of variables used for sorting were the actual keys that defined a record in a specific SDTM domain data set. Isn't this set on stone in the standard?

They are suppose to be key to uniquely identify the rows and we even warn them if we notice that they don't.

ramiromagno commented 4 months ago

Thanks @edgar-manukyan. I've updated the PR according to your feedback so far. But we will have to wait for @rammprasad's feedback on these other corner cases.