inla.surv and A matrix problem with NPredictor

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
--------------------------------------------
see code below.

What is the expected output? What do you see instead?
----------------------------------------------------------
A working model

Error in inla(formula, family = "exponential", data = data, control.predictor = 
list(A = A),  : 
  From `data', I found that NPredictor = 200 , but dim(A)[2] = 100

What version of the product are you using? On what operating system?
--------------------------------------------------------------------
INLA build date ...: Wed Dec 7 18:22:47 CET 2011 
    Revision...........: hgid: 74fcd284508c date: Wed Dec 07 09:30:34 2011 +0100 

Please provide any additional information below.
--------------------------------------------------
The problem seems to come from the fact that when setting NPredictor
on line 595 it sets the maximum dim(obj)[1] of the obj in data. But in data we 
have the observations ($time $event) which in this case are longer than the 
covariates ($x).

Jonas Wallin

code
---------------
    rate = log(1 / 10) # log of lambda
    n = 200
    x <-  cumsum(rnorm(n/2) * 0.05) 
    i_x = 1:(n/2)
    Q = toeplitz(c(2,-1, rep(0,(n/2)-3),-1))
    Q[1,1] = 1.1 #adding small 0 prior on first observation
    Q[n/2,n/2] = 1
    Q[n/2,1] = Q[1,n/2] = 0

    A = matrix(data = 0, ncol = n/2, nrow = n)
    count = 1
    for(i in 1:(n/2)){
        A[count,i] = 1
        count = count + 1
        A[count,i] = 1
        count = count + 1
    }
    rate_x = exp(A%*%x + rate)  
    time_1 <- rexp(n, rate = rate_x) # time of death
    time_2 <- rexp(n, rate = exp(rate)) # censor time
    event = (time_1 <= time_2 ) * 1
    time = apply(cbind(time_1,time_2),1,min)
    data = list(time = time, event = event, x = i_x)
    source("inla2.R")
    formula <- inla.surv(time,event)  ~ f(x, model="generic", Cmatrix = Q)  - 1 
    model   <- inla2(formula, family = "exponential" ,data = data, control.predictor= list(A=A), debug =FALSE )

Original issue reported on code.google.com by jonas.wa...@gmail.com on 9 Dec 2011 at 11:12

GoogleCodeExporter commented 9 years ago

line 595 in inla.R that is.

Original comment by jonas.wa...@gmail.com on 9 Dec 2011 at 11:13

GoogleCodeExporter commented 9 years ago

Issue 15 has been merged into this issue.

Original comment by finn.lin...@gmail.com on 10 Dec 2011 at 7:49

GoogleCodeExporter commented 9 years ago

I think the correct solution should involve looking specifically at all the 
named variables in the formula, noting if they are on the LHS or RHS of the 
formula.  Also, it needs to ignore any variables it does not know about.  E.g. 
the variable for the Cmatrix should be supplied in the data list, and as well 
as all other arguments to the f-functions.

Original comment by finn.lin...@gmail.com on 10 Dec 2011 at 7:55

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

Have added test files:
sandbox/R/issue-14.R
sandbox/R/issue-15.R

Original comment by finn.lin...@gmail.com on 10 Dec 2011 at 8:19

GoogleCodeExporter commented 9 years ago

An alternative solution if I understand thing correctly might be just to look 
at the first variable on the RHS:
if it is not an f-function it must be the length of Npredict ?
if it is     an f-function the first specified term must have length of 
Npredict?

Original comment by jonas.wa...@gmail.com on 10 Dec 2011 at 9:31

GoogleCodeExporter commented 9 years ago

The problem is that NPredictor is determined by the length of the data vectors, 
so it's a chicken and egg problem.  We're trying to make the formula parsing 
code in R do things it was never designed to do, as it assumes that all vectors 
have the same length, and doesn't properly distinguish between LHS and RHS 
variables.  The code now handles the normal case where there is only a single 
variable on the LHS, but it seems it doesn't handle at least some of the more 
general cases.

Original comment by finn.lin...@gmail.com on 10 Dec 2011 at 9:42

GoogleCodeExporter commented 9 years ago

> The problem is that NPredictor is determined by the length of the data 
vectors, so it's a chicken and egg problem.

Yes but NPredictor and MN[2] if everything is correct should be the same value?

>as it assumes that all vectors have the same length, and doesn't properly 
distinguish >between LHS and RHS variables

But for inla everything on the RHS must be of equal length right?
If so one could just find the first thing on the RHS and take it's length.

---------------------------------------------------------------------------
about the issue-15.R:
For the test-file issue-15.R
the problem will be removed if changing line 552 in inla.r:
data = inla.remove(as.character(formula[2]), data) 
to something like: 

if(have.surv){
LHS_names = strsplit(as.character(formula[2]), split="inla.surv\\(")[[1]][2]
LHS_names = strsplit(LHS_names, split=")")[[1]][1]
LHS_names = inla.trim(strsplit(LHS_names,split=",")[[1]]) #or some other trim
}
else{ LHS_names = as.character(formula[2])}
data = inla.remove(LHS_names, data)  

Comment: inla.remove(as.character(formula[2]), data)  is used on line 417 so 
should also be changed.

But if you allow for variables of Cmatrix,f-functions to be in the supplied 
data list
then  
gp = inla.interpret.formula(formula, data=data,  data.model = data.model)
will fail since it changes since it does not allow varying length of objects in 
data.

----------------------------------------------------------------------------

ps
in inla.R line 621 
the code does the same thing in both
if and else:
y...fake = c(rep(Inf, ny))
   if (debug)
     print(paste("y...fake has length", length(y...fake)))

Original comment by jonas.wa...@gmail.com on 10 Dec 2011 at 12:26

GoogleCodeExporter commented 9 years ago

Thanks for the suggested change.  Håvard will have to take a look.
When you say "the problem will be removed", does that mean "the code will do 
the right thing i all situations", or just "the error message will go away in 
these test cases"? ;-)

But the issue is more fundamental than just the survival models, and interacts 
with mis-features in the R language, that allows it to find variables that it 
shouldn't see, etc.  The formula parser is fundamentally broken by design with 
regards to what inla has evolved into.

Original comment by finn.lin...@gmail.com on 10 Dec 2011 at 12:39

GoogleCodeExporter commented 9 years ago

>Thanks for the suggested change.  Håvard will have to take a look.
>When you say "the problem will be removed", does that mean "the code will do 
the >right thing i all situations", or just "the error message will go away in 
these test cases"? ;-)

More in the test cases :) 
But I think it will solve the inla.surv case (I assume that you actually don't 
use event since the observations now is y...fake (which all -inf)). 
But in the line:
LHS_names = inla.trim(strsplit(LHS_names,split=",")[[1]]) #or some other trim
is probably better to use an other trim instance str_trim() (from stringr) than 
inla.trim() to remove the whitespaces before and after.

But if one creates very evil variable names like " event" one runs into problem.

inla.surv(time, event) ~ with data$event from (works)
inla.surv(time, event) ~ with data$" event" (causes error)

Original comment by jonas.wa...@gmail.com on 10 Dec 2011 at 1:51

GoogleCodeExporter commented 9 years ago

Yes, I think your fix would handle identifying "time" and "event" from the data 
vector (which would then both need to be put back again, like y...fake), so it 
would be a partial fix.

Original comment by finn.lin...@gmail.com on 11 Dec 2011 at 8:41

GoogleCodeExporter commented 9 years ago

These issues arise only when an "A matrix" is used.  The parsing in inla() for 
the A-matrix models would be much easier if LHS and RHS variables were already 
separated.  Since there is a new interface "inla.stack" to handle constructing 
data vectors for such models, the easiest and cleanest solution is likely to 
supply a data object generated by this system which already separates LHS from 
RHS, and only joins them together because inla() wants them in one list, to 
then only try to separate them again.  inla.stack can also handle separating 
out other data variables, such as spde model objects, etc.

The existing data parsing could remain largely unaffected.

Original comment by finn.lin...@gmail.com on 11 Dec 2011 at 9:02

GoogleCodeExporter commented 9 years ago

The easy way out is to group the surv-object into one object in the data, like 
this

y.surv = inla.surv(time, event)
data = list(y.surv = y.surv, x = i_x)
formula <- y.surv  ~ f(x, model="generic", Cmatrix = Q)  - 1

then

model   <- inla(formula, family = "exponential" ,data = data, 
control.predictor= list(A=A), debug =FALSE, verbose=TRUE )

runs fine.

Original comment by havard.rue on 11 Dec 2011 at 10:13

GoogleCodeExporter commented 9 years ago

Using the previous trick where I define y.surv, then Finn's 'stack' fails, so 
its possible that I just pushed the problem to inla.stack().....

stack=
    inla.stack(data=list(y.surv = y.surv),
               A=A,
               effects=list(x=i_x),compress=FALSE)
inla.data = inla.stack.data(stack, Q=Q)
inla.A = inla.stack.data(stack)
model   <- inla(formula, family = "exponential" ,data = inla.data, 
control.predictor= list(A=inla.A), debug =FALSE, verbose=TRUE )

gives:

 stack=
+     inla.stack(data=list(y.surv = y.surv),
+                A=A,
+                effects=list(x=i_x),compress=FALSE)
Error in parse.input.list(inla.ifelse(is.data.frame(data), list(data),  : 
  Effect block 1:
Mismatching row sizes: 5, n.A=200

Original comment by havard.rue on 11 Dec 2011 at 10:21

GoogleCodeExporter commented 9 years ago

I must add to comment 12, that grouping the data into one object in the 
data-list, is the intended usage. But, its not written down anywhere... my 
fault.

Original comment by havard.rue on 11 Dec 2011 at 10:22

GoogleCodeExporter commented 9 years ago

inla.stack needs all data-related vectors, including "E" for poisson 
likelihoods, so it needs to know about all inputs to inla.surv.  inla.stack 
needs to manipulate all vectors related to the model, not only those used by 
the formula.
For the inla.surv models as in Jonas example, at a minimum, it needs
inla.stack(data=list(time=time,event=event),...)

Hiding the vectors in an object is not a solution.

Original comment by finn.lin...@gmail.com on 11 Dec 2011 at 11:10

GoogleCodeExporter commented 9 years ago

I agree that comment 12 provides a temporary workaround for models not using 
inla.stack.

Original comment by finn.lin...@gmail.com on 12 Dec 2011 at 9:52

grantbrown / inla

inla.surv and A matrix problem with NPredictor #14