metrumresearchgroup / bbr

R interface for model and project management
https://metrumresearchgroup.github.io/bbr/
Other
23 stars 2 forks source link

Bad nm_join when NUM is rendered int in one data frame and numeric in another? #587

Open kylebaron opened 1 year ago

kylebaron commented 1 year ago

I'm seeing this on my arm64 mac. It's the first time I'm seeing it on any platform and not sure where it came from ... data.table, I guess.

Basically, NUM is getting rendered as int in the data set but numeric in the table output. I think this is causing an issue with the join ... when you coerce both to int the join works fine. The issue isn't in nm_join() per se but it could affect output.

We have run nm_join() using data.table under the hood 100s of times and it' has always worked for me, including on my intel mac. So this could be just something with my environment. But logging it here in case it isn't.

library(bbr)
library(data.table)
library(dplyr)
## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:data.table':
## 
##     between, first, last

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(readr)

NUM comes back as numeric from the .tab file

tab <- fread("model/pk/106/106.tab", skip = 1)
tab %>% select(1:3) %>% str()
## Classes 'data.table' and 'data.frame':   4292 obs. of  3 variables:
##  $ NUM  : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ IPRED: num  0 68.5 90.8 97.3 96.7 ...
##  $ NPDE : num  0 -0.534 0.279 1.555 1.881 ...
##  - attr(*, ".internal.selfref")=<externalptr>
anyNA(tab)
## [1] FALSE

NUM comes back as int from the .csv file

data <- fread("data/derived/analysis3.csv", na.strings = '.')
data %>% select(1:3) %>% str()
## Classes 'data.table' and 'data.frame':   4360 obs. of  3 variables:
##  $ C  : logi  NA NA NA NA NA NA ...
##  $ NUM: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ ID : int  1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, ".internal.selfref")=<externalptr>
anyNA(data$CMT)
## [1] FALSE

Now we’re trying to join an int with double and it doesn’t work

j <- left_join(tab, data, by = "NUM")
summary(j$CMT)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00    1.00    2.00    1.73    2.00    2.00      39

Fine if we join int with int

jj <- left_join(mutate(tab, NUM = as.integer(NUM)), data, by = "NUM")
summary(jj$CMT)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   1.732   2.000   2.000

This is join issue is happening in nm_join() too

options(bbr.verbose = FALSE)
dat <- nm_join("model/pk/106")
summary(dat$CMT)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00    1.00    2.00    1.73    2.00    2.00      39

NUM comes back as numeric from the .tab file

tab <- read_table("model/pk/106/106.tab", skip = 1)
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   NUM = col_double(),
##   IPRED = col_double(),
##   NPDE = col_double(),
##   CWRES = col_double(),
##   DV = col_double(),
##   PRED = col_double(),
##   RES = col_double(),
##   WRES = col_double()
## )
tab %>% select(1:3) %>% str()
## tibble [4,292 × 3] (S3: tbl_df/tbl/data.frame)
##  $ NUM  : num [1:4292] 1 2 3 4 5 6 7 8 9 10 ...
##  $ IPRED: num [1:4292] 0 68.5 90.8 97.3 96.7 ...
##  $ NPDE : num [1:4292] 0 -0.534 0.279 1.555 1.881 ...

NUM comes back as numeric from the .csv file

data <- read_csv("data/derived/analysis3.csv")
## Rows: 4360 Columns: 34
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (6): C, AMT, USUBJID, STUDY, ACTARM, RF
## dbl (28): NUM, ID, TIME, SEQ, CMT, EVID, DV, AGE, WT, HT, EGFR, ALB, BMI, SE...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data %>% select(1:3) %>% str()
## tibble [4,360 × 3] (S3: tbl_df/tbl/data.frame)
##  $ C  : chr [1:4360] "." "." "." "." ...
##  $ NUM: num [1:4360] 1 2 3 4 5 6 7 8 9 10 ...
##  $ ID : num [1:4360] 1 1 1 1 1 1 1 1 1 1 ...

Now we’re trying to join an double with double

jjj <- left_join(tab, data, by = "NUM")

summary(jjj$CMT)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   1.732   2.000   2.000

packageVersion("data.table") [1] ‘1.14.6’