lwheinsberg / dbGaPCheckup

Easy checks for data integrity and proper formatting of the dbGaP subject phenotype data set and data dictionary.
https://lwheinsberg.github.io/dbGaPCheckup/index.html
3 stars 2 forks source link

minmax_check: documentation #12

Closed DanielEWeeks closed 1 year ago

DanielEWeeks commented 1 year ago

The minmax_check documentation states that it returns:

'Information (A list of variables that exceed the listed MIN or MAX values).'

but actually it returns a list of unique values that lie outside of the interval defined by the MIN and MAX values.

Also might be nice to return a sorted list instead of an unsorted list.

Reword for clarity, perhaps something like this (if we decided to sort the list):

'Information (A sorted list of unique values that lie outside of the range defined by the listed MIN and MAX values).'

But that probably isn't quite right wording either, as what if only the MIN is specified and the MAX is left unspecified?

So maybe it should be:

'Information (A sorted list of unique values that are either less than the MIN value or greater than the MAX value).'

Looks like the code might be assuming that both MIN and MAX are defined:

            range_dictionary <- c(DD.dict$MIN[row], DD.dict$MAX[row])

but I don't think it is required that both be defined. So I guess it is robust to one or both of MIN, MAX being NA

> 5 < NA | 5 > 3
[1] TRUE
DanielEWeeks commented 1 year ago

Also there is a spelling mistake in the title of this help page:

"Mimimum and Maximum Values Check"

should instead be

"Minimum and Maximum Values Check"

DanielEWeeks commented 1 year ago

Yes, because the 'which' command is used in here:

        flagged <- dataset_na[which(dataset_na[, ind] < 
          range_dictionary[1] | dataset_na[, ind] > 
          range_dictionary[2]), , drop = FALSE]

the code is robust to having one or both of the MIN and MAX values be missing.

lwheinsberg commented 1 year ago

Resolutions:

(1) Corrected spelling error in page title ("Mimimum and Maximum Values Check" changed to "Minimum and Maximum Values Check"). (2) Added sorting to information return list. (3) Made the return description more informative: 'Information (A sorted list of unique values that are either less than the MIN value or greater than the MAX value).'

Item (2) was trickier than I expected. Incidentally, I discovered that there was a small bug that made the function behave differently in the presence of non.na.missing.codes.

if ( length(na.omit(non.NA.missing.codes)) == 0) {
      dataset_na <- DS.data
    } else {
      dataset_na <- data.frame(replace_with_na_all(DS.data, conditionFormula))
    }

Under the if statement, dataset_na was a data frame while under the else statement, dataset_na was a tibble. I corrected the behavior by forcing the temporary dataset_na to be a data frame in the presence of missing value codes as well so the sort function added beyond this code would work properly.

DanielEWeeks commented 1 year ago

But as a tibble is a data frame, so I would expect the sort function to work the same on a tibble as it does on a data frame. So not sure I understand what was the problem?

Here's an example that shows that the gapminder tibble is also a data frame:

> class(gapminder)
[1] "tbl_df"     "tbl"        "data.frame"