lwheinsberg / dbGaPCheckup

Easy checks for data integrity and proper formatting of the dbGaP subject phenotype data set and data dictionary.
https://lwheinsberg.github.io/dbGaPCheckup/index.html
3 stars 2 forks source link

`integer_check` bug when `SUBJECT_ID` is character #5

Closed DanielEWeeks closed 1 year ago

DanielEWeeks commented 1 year ago

All of the example data frames A through O contained only numeric columns, but P contains a character SUBJECT_ID, which leads integer_check to fail. Presumably it would fail in the presence of one or more non-numeric columns because isTRUE(all(data == floor(data), na.rm = TRUE)) only works if all the entries in the data data frame are numeric.

> integer_check(DD.dict.A, DS.data.P)
Error in `summarize()`:
ℹ In argument: `across(all_of(int.vars), int_check, .names =
  "{.col}")`.
Caused by error in `across()`:
! Can't compute column `SUBJECT_ID`.
Caused by error in `floor()`:
! non-numeric argument to mathematical function
Run `rlang::last_error()` to see where the error occurred.

> rlang::last_error()
<error/rlang_error>
Error in `summarize()`:
ℹ In argument: `across(all_of(int.vars), int_check, .names =
  "{.col}")`.
Caused by error in `across()`:
! Can't compute column `SUBJECT_ID`.
Caused by error in `floor()`:
! non-numeric argument to mathematical function
---
Backtrace:
  1. dbGaPCheckup::integer_check(DD.dict.A, DS.data.P)
 13. dbGaPCheckup (local) `<fn>`(SUBJECT_ID)
 14. base::isTRUE(all(data == floor(data), na.rm = TRUE))
lwheinsberg commented 1 year ago

Dan added ExampleR that includes character value SUBJECT_ID. Updated documentation. I added this as an example to integer_check.

lwheinsberg commented 1 year ago

The error produced when we try integer_check(DD.dict.A, DS.data.P) is because of a mismatch in specified TYPE between the files called. In DD.dict.A, SUBJECT_ID is listed as TYPE integer. But in DS.data.P, SUBJECT_ID is of TYPE character. So the function is trying to apply int_check to a non-numeric column because it is listed as being numeric.

To resolve this issue, I adjusted integer_check to (1) perform int_check only on numeric columns and (2) add names of variables that are listed as TYPE numeric but read into R as type string to the Information returned to user.