metrumresearchgroup / mrgsolve

Simulate from ODE-based population PK/PD and QSP models in R
https://mrgsolve.org
GNU General Public License v2.0
130 stars 36 forks source link

NA values in input data sets #312

Closed billdenney closed 6 years ago

billdenney commented 6 years ago

I'm not sure where to start looking for the fix to the error mentioned in the subject. The full error is below (and I unfortunately can't share the code).

> vpcresult <-
+   model %>%
+   data_set(vpc_dataset) %>%
+   mrgsim()
Dropping non-numeric columns: C EXCLUDE_TEXT Treatment_Text USUBJID CMT_name

error: Mat::operator(): index out of bounds
Error in tran_mrgsim(<S4 object of class "mrgmod">, data = c(1, 1, 1,  : 
  Mat::operator(): index out of bounds
kylebaron commented 6 years ago

Never seen this before. But it's is in some C++ code somewhere and the model functions never get a matrix; so maybe the $OMEGA or $SIGMA matrices?

To start narrowing it down:

If you can:

revar(model)
vpcresult <-
  model %>%
  mrgsim()
vpcresult <-
  model %>%
  drop_re() %>%
  mrgsim()
vpcresult <-
  model %>%
  drop_re() %>%
  data_set(vpc_dataset) %>%
  mrgsim()
billdenney commented 6 years ago

It seems happy enough with drop_re:

> revar(model)
$omega
$...
             [,1]      [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]    [,14]    [,15] [,16] [,17] [,18] [,19] [,20] [,21]    [,22] [,23] [,24] [,25] [,26]
ETA1:   0.0732038 0.0261146    0    0    0    0    0    0    0     0     0     0     0 0.000000 0.000000     0     0     0     0     0     0 0.000000     0     0     0     0
ETA2:   0.0261146 0.0302285    0    0    0    0    0    0    0     0     0     0     0 0.000000 0.000000     0     0     0     0     0     0 0.000000     0     0     0     0
ETA3:   0.0000000 0.0000000    0    0    0    0    0    0    0     0     0     0     0 0.000000 0.000000     0     0     0     0     0     0 0.000000     0     0     0     0
ETA4:   0.0000000 0.0000000    0    0    0    0    0    0    0     0     0     0     0 0.000000 0.000000     0     0     0     0     0     0 0.000000     0     0     0     0
ETA5:   0.0000000 0.0000000    0    0    0    0    0    0    0     0     0     0     0 0.000000 0.000000     0     0     0     0     0     0 0.000000     0     0     0     0
ETA6:   0.0000000 0.0000000    0    0    0    0    0    0    0     0     0     0     0 0.000000 0.000000     0     0     0     0     0     0 0.000000     0     0     0     0
ETA7:   0.0000000 0.0000000    0    0    0    0    0    0    0     0     0     0     0 0.000000 0.000000     0     0     0     0     0     0 0.000000     0     0     0     0
ETA8:   0.0000000 0.0000000    0    0    0    0    0    0    0     0     0     0     0 0.000000 0.000000     0     0     0     0     0     0 0.000000     0     0     0     0
ETA9:   0.0000000 0.0000000    0    0    0    0    0    0    0     0     0     0     0 0.000000 0.000000     0     0     0     0     0     0 0.000000     0     0     0     0
ETA10:  0.0000000 0.0000000    0    0    0    0    0    0    0     0     0     0     0 0.000000 0.000000     0     0     0     0     0     0 0.000000     0     0     0     0
ETA11:  0.0000000 0.0000000    0    0    0    0    0    0    0     0     0     0     0 0.000000 0.000000     0     0     0     0     0     0 0.000000     0     0     0     0
ETA12:  0.0000000 0.0000000    0    0    0    0    0    0    0     0     0     0     0 0.000000 0.000000     0     0     0     0     0     0 0.000000     0     0     0     0
ETA13:  0.0000000 0.0000000    0    0    0    0    0    0    0     0     0     0     0 0.000000 0.000000     0     0     0     0     0     0 0.000000     0     0     0     0
ETA14:  0.0000000 0.0000000    0    0    0    0    0    0    0     0     0     0     0 0.411096 0.000000     0     0     0     0     0     0 0.000000     0     0     0     0
ETA15:  0.0000000 0.0000000    0    0    0    0    0    0    0     0     0     0     0 0.000000 0.107419     0     0     0     0     0     0 0.000000     0     0     0     0
ETA16:  0.0000000 0.0000000    0    0    0    0    0    0    0     0     0     0     0 0.000000 0.000000     0     0     0     0     0     0 0.000000     0     0     0     0
ETA17:  0.0000000 0.0000000    0    0    0    0    0    0    0     0     0     0     0 0.000000 0.000000     0     0     0     0     0     0 0.000000     0     0     0     0
ETA18:  0.0000000 0.0000000    0    0    0    0    0    0    0     0     0     0     0 0.000000 0.000000     0     0     0     0     0     0 0.000000     0     0     0     0
ETA19:  0.0000000 0.0000000    0    0    0    0    0    0    0     0     0     0     0 0.000000 0.000000     0     0     0     0     0     0 0.000000     0     0     0     0
ETA20:  0.0000000 0.0000000    0    0    0    0    0    0    0     0     0     0     0 0.000000 0.000000     0     0     0     0     0     0 0.000000     0     0     0     0
ETA21:  0.0000000 0.0000000    0    0    0    0    0    0    0     0     0     0     0 0.000000 0.000000     0     0     0     0     0     0 0.000000     0     0     0     0
ETA22:  0.0000000 0.0000000    0    0    0    0    0    0    0     0     0     0     0 0.000000 0.000000     0     0     0     0     0     0 0.132547     0     0     0     0
ETA23:  0.0000000 0.0000000    0    0    0    0    0    0    0     0     0     0     0 0.000000 0.000000     0     0     0     0     0     0 0.000000     0     0     0     0
ETA24:  0.0000000 0.0000000    0    0    0    0    0    0    0     0     0     0     0 0.000000 0.000000     0     0     0     0     0     0 0.000000     0     0     0     0
ETA25:  0.0000000 0.0000000    0    0    0    0    0    0    0     0     0     0     0 0.000000 0.000000     0     0     0     0     0     0 0.000000     0     0     0     0
ETA26:  0.0000000 0.0000000    0    0    0    0    0    0    0     0     0     0     0 0.000000 0.000000     0     0     0     0     0     0 0.000000     0     0     0     0

$sigma
$...
    [,1] [,2] [,3]
1:     0    0    0
2:     0    0    0
3:     0    0    0

> vpcresult <-
+   model %>%
+   mrgsim()
> vpcresult <-
+   model %>%
+   drop_re() %>%
+   mrgsim()
> vpcresult <-
+   model %>%
+   drop_re() %>%
+   data_set(vpc_dataset) %>%
+   mrgsim()
Dropping non-numeric columns: C EXCLUDE_TEXT Treatment_Text USUBJID CMT_name
> 
billdenney commented 6 years ago

Also, a simpler simulation where I'm generating the dataset by hand works without issue (without drop_re).

Is there a chance that I'm passing in a variable that I'm also defining in the model and that's causing the problem? (i.e. a column name in vpc_dataset is the same as a parameter that may be calculated in the model. I'm checking that now.)

billdenney commented 6 years ago

Might it have something to do with a potentially repeated ID number in the dataset?

billdenney commented 6 years ago

When I change n_sub from 1 to 2 in the code below, it gets the error. Is there anything obvious there?

## Visual Predictive Checks
n_sub <- 1

vpc_dataset <-
  d %>%
  filter(is.na(EXCLUDE_TEXT)) %>%
  full_join(expand.grid(ID=unique(d$ID),
                        IDNEW=1:n_sub) %>%
              mutate(IDNEW=as.numeric(factor(paste(ID, IDNEW))))) %>%
  rename(ID_ORIG=ID,
         ID=IDNEW) %>%
  arrange(ID_ORIG, ID, TIME, desc(EVID), CMT)

vpcresult <-
  model %>%
  data_set(vpc_dataset) %>%
  mrgsim()

P.S. I'm going for the record of most comments in the shortest time. :)

billdenney commented 6 years ago

Now it's getting weird: When I remove sorting by the ID_ORIG column, I can go up to n_sub <- 6 (n_sub=6 has no error; n_sub=7 has an error).

billdenney commented 6 years ago

OK, I think I found the underlying error: some of the TIME column values were NA. Filtering those out fixed the error.

And on top of that, I was capturing TIME to use time after dose in one of the equations which made the differential equation results all NaN. I'll try to make a minimally reproducible example shortly (unless this makes the fix obvious to you and you don't need more).

kylebaron commented 6 years ago

Ok; hrm ... I was thinking it was some issue with $OMEGA, but now maybe it's $SIGMA ; not sure. I'll try to reproduce as well. There isn't any checking for NA values going on; usually when it's a covariate, just fills the simulation with NA. Wasn't aware that missing TIME could fail so cryptically.

billdenney commented 6 years ago

With this, it's raising NA or NaN to a power at one spot, and at a different point in the model, it is taking the square root of NA or NaN. These are then being multiplied by EPS(1) (or EPS(2) or EPS(3)).

As I look more, there could be a lot of math going on with NA inputs.

Another thought that may have to do with the sorting. If all data were NA and had math occur (e.g. pow(), sqrt(), exp(), or one of the basic math operators +, -, *, or /) instead of just some of the data being NA could that narrow down the problem?

kylebaron commented 6 years ago

I couldn't get the matrix access problem; but did see this (sort of what I was expecting)

library(mrgsolve)

mod <- mread("popex", modlib())
#> Compiling popex ... done.

mod <- mrgsolve:::house()

data(exTheoph)

df <- exTheoph

df$time[c(4, 12, 33)] <- NA

mod <- omat(mod, dmat(rep(1, 4)))

mod <- smat(mod, dmat(1))

mrgsim(mod, data = df)
#> Model:  housemodel 
#> Dim:    132 x 7 
#> Time:   NA to NA 
#> ID:     12 
#>      ID time      GUT   CENT  RESP       DV       CP
#> [1,]  1 0.00 4.020000 0.0000 13.33 0.000000 0.000000
#> [2,]  1 0.25 3.191658 0.8262 13.33 0.009952 0.006478
#> [3,]  1 0.57 2.375462 1.6345 13.32 0.031590 0.012815
#> [4,]  1   NA       NA     NA    NA       NA       NA
#> [5,]  1 2.02 0.623063 3.3100 13.30 0.063974 0.025953
#> [6,]  1 3.82 0.118310 3.6872 13.30 0.031993 0.028910
#> [7,]  1 5.10 0.036304 3.6754 13.30 0.064719 0.028817
#> [8,]  1 7.03 0.006114 3.5666 13.30 0.038954 0.027964

But I also saw something different with a different model

library(mrgsolve)

mod <- mread("popex", modlib())
#> Compiling popex ... done.

data(exTheoph)

df <- exTheoph

df$time[c(4, 12, 33)] <- NA

mod <- omat(mod, dmat(rep(1, 3)))

mod <- smat(mod, dmat(1))

mrgsim(mod, data = df)
#> Model:  popex 
#> Dim:    132 x 9 
#> Time:   NA to NA 
#> ID:     12 
#>      ID time   GUT  CENT    CL     V  IPRED     DV    ECL
#> [1,]  1 0.00 4.020 0.000 1.932 7.799 0.0000 0.0000 0.5624
#> [2,]  1 0.25 2.938 1.048 1.932 7.799 0.1343 0.1174 0.5624
#> [3,]  1 0.57 1.966 1.899 1.932 7.799 0.2435 0.0792 0.5624
#> [4,]  1   NA    NA    NA 1.932 7.799     NA     NA 0.5624
#> [5,]  1 2.02    NA    NA 1.932 7.799     NA     NA 0.5624
#> [6,]  1 3.82    NA    NA 1.932 7.799     NA     NA 0.5624
#> [7,]  1 5.10    NA    NA 1.932 7.799     NA     NA 0.5624
#> [8,]  1 7.03    NA    NA 1.932 7.799     NA     NA 0.5624
billdenney commented 6 years ago

I haven't been able to make a minimal reproducible example, but I have made something else that gives an unusual error:

code_test <- '
$PARAM THETA1=1.0, THETA2=1.0, THETA3=1.0, COVARIATE1=-1.0, COVARIATE2=-1.0
$CMT CENTRAL
$OMEGA @annotated
ETA1: 0.01: foo
$SIGMA 0
$MAIN
// Record the dosing time
if (NEWIND < 2 || EVID == 1.0 || EVID == 4.0)
  double TDOSE = TIME;
double K10=THETA1*pow(-0.1, COVARIATE1);
double V1=THETA2*pow(-0.1, COVARIATE2)*exp(ETA1);
double PROP_ERR=THETA3;

$ODE
double K10TIME = K10*exp(-2*(SOLVERTIME-TDOSE));
dxdt_CENTRAL = -K10TIME*CENTRAL;

$TABLE
double CP = CENTRAL/V1;
double CPeps  = CP  + EPS(1)*sqrt(pow(PROP_ERR*CP, 2));
$CAPTURE CP CPeps
'

model_raw <- mcode("code_test", code_test)

model_param <- list(THETA1=1, THETA2=1, THETA3=1)

model <-
  model_raw %>%
  param(model_param) %>%
  omat(as_bmat(0.05))

simdata <-
  data.frame(ID=1,
             cmt=1,
             time=c(0, 0:5),
             COVARIATE1=2,
             COVARIATE2=3,
             evid=c(1, rep(0, 6)),
             amt=1)

simresult <-
  model %>%
  data_set(bind_rows(data.frame(ID=1, evid=1, time=NA), simdata)) %>%
  mrgsim()
Error in tran_mrgsim(<S4 object of class "mrgmod">, data = c(1, 1, 1,  : 
  vector::_M_range_check: __n (which is 4294967295) >= this->size() (which is 1)
billdenney commented 6 years ago

@kylebmetrum, To help with debugging the original issue, is there a specific part of the code that I can examine to find the problem? Is there a way to narrow it down to a line number, an operation, or a variable?

kylebaron commented 6 years ago

@billdenney Sorry for the delay ... our internet service was out overnight and just came back.

I know where the issue is - trying to access a vector at index NA. There needs to be a check for missing values and an error generated at least for some of these data items. I'm afraid that if(any(is.na(simdata))) stop() is going to be too heavy handed at first since people probably pass missing data through in the data set and it has no consequence. But maybe that should be the goal eventually. I'd use it as a check in the meantime.

I'll put in some code to start requiring it for cmt, time, etc ...

billdenney commented 6 years ago

@kylebmetrum, no worries at all on the delay. Once I got to the point that removing data with NA time fixed my issue, it was more about helping make mrgsolve more robust.

While I think of mrgsolve as mostly an exceptionally useful ODE-integrating tool, I also see its value in time-independent simulations (e.g, if I'm fitting an Emax model). With that, I agree that if(any(is.na(simdata))) stop() is too heavy-handed. And, there may be times that I want to pass in NA[1].

Warnings that NA are there in required columns seems reasonable, and giving errors if NA are in required pairs (e.g. evid %in% c(1, 4, 8) and is.na(amt) seems like it would be an error).

[1] Though I'd have to look up how to test for NA and run a model with NA in mrgsolve. I've not looked, but if it doesn't exist in the docs, NA and NaN handling would be a useful topic. For me, how to test for equality to NA within a code block (probably $MAIN but generally in any code) would be useful.

kylebaron commented 6 years ago

@billdenney You can test for NA with isnan

> library(mrgsolve)
>   
>   code <- '
+ $PARAM FOO = 1
+ 
+ $MAIN
+ if(isnan(FOO)) {
+   report("Foo is not a number");
+   report("time ", TIME);
+ }
+
+ $CAPTURE FOO
+ '
>   
>   mod <- mcode("nan", code)
Compiling nan ... done.
>   
>   data <- data.frame(amt = 0, time = c(1,2,3,4), cmt = 0,
+                      FOO = c(1,2,NA,4), evid = 0, ID = 1)
>   
>   mrgsim(mod, data = data, end = -1)
from report Foo is not a number
time  3
Model:  nan 
Dim:    4 x 3 
Time:   1 to 4 
ID:     1 
     ID time FOO
[1,]  1    1   1
[2,]  1    2   2
[3,]  1    3  NA
[4,]  1    4   4

It's just tricky; the parameter update will happen under the hood no matter what; so if you have some other path to take, you can do it. But there isn't an "automatic" way to get back to the last value before you see the nan etc. I dunno ... maybe this is enough to get done what you want to do.

billdenney commented 6 years ago

Thanks!