Closed houtel closed 4 months ago
Comments from Venkata Maguluri (Pfizer)****
In roak we warn user if the keys do not identify distinct rows. Shall we build such functionality as well? We can help them to identify those duplicate rows.
We can try if to determine whether data is already sorted.
As a separate step, in roak we convert all empty string values "" into NAcharacter in all raw datasets before any of the functions start manipulating the data.
Dear developer,
If the function will work like the tidyeval way that quotations are not needed for data frame vairables generate_seq (ds_temp, key_vars = c(USUBJID, DSCAT, DSSCAT, DSTERM))
, and there is a similar function in admiral::derive_var_obs_number
Hi Adam (@galachad):
I see that this issue is assigned to you, but it hasn't seen any activity for a long while. Are you still working on this?
Reference function in {roak} - https://github.com/pharmaverse/roak_pilot/blob/main/R/oak_derive_seq.R
Hi @rammprasad and @edgar-manukyan:
Can't you provide a few input and output data set examples for me to test my implementation?
@ramiromagno, please find the domain_key_variables.csv which determines how the domain needs to be sorted (ideally into unique rows) then the --SEQ variable gets derived.
ds_in <- tibble::tribble(
~STUDYID, ~DOMAIN, ~USUBJID, ~VSSPID, ~VSTESTCD, ~VSDTC, ~VSTPTNUM,
"ABC123", "VS", "ABC123-375", "/F:VTLS1-D:9795532-R:2", "DIABP", "2020-09-01T13:31", NA,
"ABC123", "VS", "ABC123-375", "/F:VTLS1-D:9795532-R:2", "TEMP", "2020-09-01T13:31", NA,
"ABC123", "VS", "ABC123-375", "/F:VTLS2-D:9795533-R:2", "DIABP", "2020-09-28T11:00", 2,
"ABC123", "VS", "ABC123-375", "/F:VTLS2-D:9795533-R:2", "TEMP", "2020-09-28T11:00", 2,
"ABC123", "VS", "ABC123-376", "/F:VTLS1-D:9795591-R:1", "DIABP", "2020-09-20", NA,
"ABC123", "VS", "ABC123-376", "/F:VTLS1-D:9795591-R:1", "TEMP", "2020-09-20", NA
)
result <- oak_derive_seq(ds_in)
expect_equal(result$VSSEQ,
c(1L, 2L, 3L, 4L, 1L, 2L))
ds_in <- tibble::tribble(
~STUDYID, ~DOMAIN, ~USUBJID, ~VSSPID,
"ABC123", "ZZ", "ABC123-375", "/F:VTLS1-D:9795532-R:2",
)
expect_error(
oak_derive_seq(ds_in),
paste(
"ZZ domain keys must be in the domain_key_variables.csv",
"Please update the file and use oak_load_study_config().",
sep = "\n"
)
)
ds_in <- tibble::tribble(
~STUDYID, ~RSUBJID, ~SCTESTCD, ~DOMAIN, ~SREL, ~SCCAT,
"ABC123", "ABC123-210", "LVSBJIND", "APSC", "FRIEND", "CAREGIVERSTUDY",
"ABC123", "ABC123-210", "EDULEVEL", "APSC", "FRIEND", "CAREGIVERSTUDY",
"ABC123", "ABC123-210", "TMSPPT", "APSC", "FRIEND", "CAREGIVERSTUDY",
"ABC123", "ABC123-211", "CAREDUR", "APSC", "SIBLING", "CAREGIVERSTUDY",
"ABC123", "ABC123-211", "LVSBJIND", "APSC", "SIBLING", "CAREGIVERSTUDY",
"ABC123", "ABC123-212", "JOBCLAS", "APSC", "SPOUSE", "CAREGIVERSTUDY"
)
result <- oak_derive_seq(ds_in)
expect_equal(result$SCSEQ,
c(1L, 2L, 3L, 1L, 2L, 1L))
Thanks @edgar-manukyan!
In test 3, ds_in
does not contain all key variables. According to file domain_key_variables.csv, these variables: USUBJID, SCSPID, SCTESTCD and VISITNUM should also be there, isn't it? How could then the function oak_derive_seq()
work in that case?
Thanks @edgar-manukyan!
In test 3,
ds_in
does not contain all key variables. According to file domain_key_variables.csv, these variables: USUBJID, SCSPID, SCTESTCD and VISITNUM should also be there, isn't it? How could then the functionoak_derive_seq()
work in that case?
Awesome observation @ramiromagno. This is testing so called associated person domain and I see in the roak https://github.com/pharmaverse/roak_pilot/blob/main/R/oak_derive_seq.R#L42
I see, sorry for the oversight!
BTW: Just one more question: is the domain_key_variables.csv comprehensive?
I'm sorry if I am overseeing something here again, but if the domain is APSC, shouldn't the column APID be there in ds_in
?
I'm sorry if I am overseeing something here again, but if the domain is APSC, shouldn't the column APID be there in
ds_in
?
Interestingly roak just ignores them and you should ask Ram about this :) https://github.com/pharmaverse/roak_pilot/blob/main/R/oak_derive_seq.R#L80
I see. Could it be that not all keys are mandatory? There might be a few that are optional, and in that case it could fine to sort only with what is available...? @rammprasad help please! :)
I see, sorry for the oversight!
BTW: Just one more question: is the domain_key_variables.csv comprehensive?
No worries, you are picking up SDTM concepts so quickly. After three years, I still feel dizzy about it. The attached file was used for the tests. This one domain_key_variables (2).csv is more comprehensive, though as Ram said it is dynamic and study teams will change it based on their setup. That's the reason why they call it a configuration file.
Thank you @edgar-manukyan, that really helps! You're the best. I thought those set of variables used for sorting were the actual keys that defined a record in a specific SDTM domain data set. Isn't this set on stone in the standard?
Thank you @edgar-manukyan, that really helps! You're the best. I thought those set of variables used for sorting were the actual keys that defined a record in a specific SDTM domain data set. Isn't this set on stone in the standard?
They are suppose to be key to uniquely identify the rows and we even warn them if we notice that they don't.
Thanks @edgar-manukyan. I've updated the PR according to your feedback so far. But we will have to wait for @rammprasad's feedback on these other corner cases.
Feature Idea
Purpose Generate --SEQ when the set of columns which define the natural key for a domain and an initial value for each USUBJID (defaulted to 1) are provided by parameter.
Functionality The feature should generate --SEQ for a given domain using the following algorithm. Note that this algorithm assumes that any split domains are combined into a single data frame prior to generating the --SEQ column. 1) Sort the domain by the provided natural key. If the keys do not identify distinct rows, produce a warning. 2) Set --SEQ to the initial value (provided by parameter) for the first record having a given USUBJID. The column name for --SEQ is determined by concatenating the value of the DOMAIN column with “SEQ” (e.g. VSSEQ when DOMAIN = “VS”). 3) Increment --SEQ by 1 for each successive record for the given USUBJID.
Relevant Input
Data frame containing all domain columns except for the --SEQ column and a vector containing the natural key columns for the domain.
Relevant Output
Data frame containing all domain columns including the --SEQ column.
Reproducible Example/Pseudo Code
generate_seq (tar_dat, tar_var = "xxSEQ" , key_vars = c(“USUBJID”, “XXCAT”, “XXSCAT”, “XXTERM”), init_val = 1) Example: generate_seq (ds, "dsseq", key_vars = c(“USUBJID”, “DSCAT”, “DSSCAT”, “DSTERM”))
NOTE: CRAN package {sdtmval} contains assign_SEQ function which could be utilized for this purpose unless we want to minimize the dependency.