lwheinsberg / dbGaPCheckup

Easy checks for data integrity and proper formatting of the dbGaP subject phenotype data set and data dictionary.
https://lwheinsberg.github.io/dbGaPCheckup/index.html
3 stars 2 forks source link

`missing_value_check` #10

Closed DanielEWeeks closed 1 year ago

DanielEWeeks commented 1 year ago

In the 8/14 version of the 2017-2019 Samoan data, missing_value_check flags some variables even though they properly have NA=N/A in their first VALUES column.

I think it isn't working because the NA=N/A is mapped by the value_meaning_table function to a VALUE of "NA" (a character string) instead of to a NA (R missing value code), but then this line is setting up to check for the NA R missing value code:

    codes <- c(NA, unique(na.omit(non.NA.missing.codes)))

Oh, but codes is used with the NA R missing value code to find those columns in the data that contain at least one NA via this line:

m.cols <- DS.data %>% select_if(~any(. %in% code)) %>% 
        names()

But then if the code is NA, then the next line after the m.cols line is this:

DD.cols <- tb %>% filter(.data$VALUE == code)

When the code is NA, this does not find any such columns because in the tb, the NA is instead the character string "NA".

If I instead do

DD.cols <- tb %>% filter(.data$VALUE =="NA")

then I do find all the columns that have a NA=N/A VALUES=MEANING mapping.

So maybe the solution here is to have two parallel codes:

lwheinsberg commented 1 year ago

This issue has been resolved with the following modification/addition related to DD.cols <- tb %>% filter(.data$VALUE == code) line mentioned above:

# Find columns in the data dictionary that specify a value for the given code
    if (is.na(code)) { # Change to resolve issue #10 making this search conditional upon code being NA or non-NA
      DD.cols <- tb %>% 
        filter(.data$VALUE=="NA") 
    } else {
      DD.cols <- tb %>% 
        filter(.data$VALUE==code) 
    }

Incidentally when investigating this error, I also discovered that the column names in ExampleS were not correct (X.SUBJECT_ID vs. SUBJECT_ID, etc.). I have now corrected this.