pharmaverse / admiral

ADaM in R Asset Library
https://pharmaverse.github.io/admiral
Apache License 2.0
215 stars 60 forks source link

Feature Request: Can we add a setting to convert all missing character variable to "" instead of NA #2466

Closed WangLaoK closed 3 weeks ago

WangLaoK commented 1 month ago

Feature Idea

By default, admiral package will derive NA for missing character variables. e.g.: if paramcd/param not mapped for a derive_vars_merged_lookup() function.

However, character variables should be "" instead of NA there. could we add below code or add some global setting to ensure all characters variable use "" instead of NA. na2blank <- function(x){x <- ifelse(is.character(x) & is.na(x),"",x)}

Relevant Input

No response

Relevant Output

No response

Reproducible Example/Pseudo Code

No response

bundfussr commented 1 month ago

Hi @WangLaoK ,

Such a function is already available: convert_na_to_blanks()

Please note that all admiral functions assume that a missing character value (in CDISC sense) is represented as NA. I.e., all input dataset must adhere to that. Otherwise, the results may be wrong.

WangLaoK commented 1 month ago

Hi @bundfussr ,

I agree with you we treat blanks string and NA as different types in R environment, .

But I think if we convert the data to XPT(For submission), then they will be treated as "" for all. ( XPT don't have NA for object type).

Then this may cause a lot of messy in define.xml rule/derivation and not good for reviewer to see two or more conditions for a missing rule.

For example, when I want to do some post process to handle both NA and "" records from BASEC and derive another variable DTYPE case_when( is.na(BASEC) | BASEC == "" ~ "NONBASE", (If we have a setting and put all NA to blank, then only one condition here) T ~ "BASE")

BR, Hans

bms63 commented 1 month ago

Hi @WangLaoK thanks for the Feature Request.

I'm wondering why you would allow NAs and blanks into a variable that represent the same thing? You should just pick one and I really recommend using NAs.

For submission - I would keep the derivations in the define language agnostic so an FDA Reviewer can use the tools in SAS or R to follow along with your derivation.

bundfussr commented 1 month ago

For example, when I want to do some post process to handle both NA and "" records from BASEC and derive another variable DTYPE case_when( is.na(BASEC) | BASEC == "" ~ "NONBASE", (If we have a setting and put all NA to blank, then only one condition here) T ~ "BASE")

@WangLaoK , if you need to read in a xpt file, I would call convert_blanks_to_na directly after that. Then you can use a single condition (is.na()) and you are in line with the admiral functions.

WangLaoK commented 1 month ago

Hi @bundfussr and @bms63 thank you for suggestions. But I still prefer to have a setting to avoid different type of missing value when we create ADaM variables. I found that in source code derive_vars_merged() have a parameter may fit my request. I try to call this missing_value but it doesn't work. it seems it is not written in the function, could you please help to add this missing_value to other function? image

bundfussr commented 1 month ago

Hi @bundfussr and @bms63 thank you for suggestions. But I still prefer to have a setting to avoid different type of missing value when we create ADaM variables. I found that in source code derive_vars_merged() have a parameter may fit my request.

Hi @WangLaoK , some of the admiral functions like derive_vars_merged() or derive_var_merge_exist_flag() have a missing_value or missing_values argument. However, the intention of these arguments is different from yours. These arguments allow to specify a value for subjects which are not included in the additional dataset (dataset_add). By default all new variables are set to NA for subjects which are not included in the additional dataset. If these subjects should be included in a summary, NA isn't a good value. The missing_values argument allows to define a non-NA value like "Missing" or "Unknown" for them. In principle you could use the argument to set the value to "" but I would disadvise that for the reasons mentioned before.

I don't like to add a global setting for specifying whether NA or "" should be used as missing value because then we would need to update all functions and templates such that they either take the global setting into account or consider both NA and "" as missing value. This would be a lot of work and would make the code and templates less readable. It would also be an additional burden for the users as they would need to do the same in their code.

Just considering NA as a missing value seems much easier and less error-prone to me.

pangchaoran commented 4 weeks ago

Hi @WangLaoK ,

Such a function is already available: convert_na_to_blanks()

Please note that all admiral functions assume that a missing character value (in CDISC sense) is represented as NA. I.e., all input dataset must adhere to that. Otherwise, the results may be wrong.

@bundfussr , in the final R data (e.g. adsl.rds to be used in TFLs), do you also suggest to present missing as NA, which is more compatible with admiral or other pharmaverse packages? In this case, in the review guidance submitted to HA, we may also need to suggest HA to use convert_blanks_to_na after read_xpt.

bundfussr commented 4 weeks ago

Hi @WangLaoK , Such a function is already available: convert_na_to_blanks() Please note that all admiral functions assume that a missing character value (in CDISC sense) is represented as NA. I.e., all input dataset must adhere to that. Otherwise, the results may be wrong.

@bundfussr , in the final R data (e.g. adsl.rds to be used in TFLs), do you also suggest to present missing as NA, which is more compatible with admiral or other pharmaverse packages? In this case, in the review guidance submitted to HA, we may also need to suggest HA to use convert_blanks_to_na after read_xpt.

Yes, I think it makes sense to use the same approach in all programs (SDTMs, ADaMs, and TLFs).

I'm not sure if we need to add anything to the review guidance because it depends on which tools are used by the HA. If they have their own tools which represent a missing value as "", they shouldn't use convert_blanks_to_na(). If they use pharmaverse tools like admiral, they should use it. As this is already described in the documentation (Handling of Missing Values), I wouldn't add it to the review guidance.

WangLaoK commented 3 weeks ago

Hi @bundfussr and @bms63 thank you for suggestions. But I still prefer to have a setting to avoid different type of missing value when we create ADaM variables. I found that in source code derive_vars_merged() have a parameter may fit my request.

Hi @WangLaoK , some of the admiral functions like derive_vars_merged() or derive_var_merge_exist_flag() have a missing_value or missing_values argument. However, the intention of these arguments is different from yours. These arguments allow to specify a value for subjects which are not included in the additional dataset (dataset_add). By default all new variables are set to NA for subjects which are not included in the additional dataset. If these subjects should be included in a summary, NA isn't a good value. The missing_values argument allows to define a non-NA value like "Missing" or "Unknown" for them. In principle you could use the argument to set the value to "" but I would disadvise that for the reasons mentioned before.

I don't like to add a global setting for specifying whether NA or "" should be used as missing value because then we would need to update all functions and templates such that they either take the global setting into account or consider both NA and "" as missing value. This would be a lot of work and would make the code and templates less readable. It would also be an additional burden for the users as they would need to do the same in their code.

Just considering NA as a missing value seems much easier and less error-prone to me.

Hi @bundfussr , thank you so much for your time. I will follow the guidance when I use admiral packages. please close the issue.