Predicting with missing values in non-Gaussian case

helske / KFAS

KFAS: R Package for Exponential Family State Space Models

64 stars 17 forks source link

Predicting with missing values in non-Gaussian case #32

Closed jhal324 closed 5 years ago

jhal324 commented 6 years ago

I have noticed that in the case of predicting new data with missing values when the model is non-Gaussian your predict.SSModel throws out an error of NA's in a foreign function call. This is from a call to .Fortran() (in my case line #329 of predict.SSModel.R). It is simply fixed by changing the default argument of NAOK = FALSE to NAOK = TRUE.

It has occurred to me that this could be deliberate due to the issues surrounding prediction with missing values, but in my case I had no choice. I thought it would be wise to let you decide how to proceed with this issue - perhaps allow NA values and give a warning?

Many thanks. P.S. I'd suggest that line #329 is not the only case of this problem.

helske commented 6 years ago

Can you give a reproducible example of this? The line 329 is related to variance computation based on the importance sampling, there is likely some issue in computing the weights in the previous stages, meaning that importance sampling likely does not work in your case for some reason (poor model specification, approximation does not work, numerical issues, ...), leading for example infinite weights.

For these cases there probably should be some checks in place after the importance sampling and more informative error message.

jhal324 commented 6 years ago

Yes, sorry for the delay - I have been away. I've attached a snippet of the problem (I don't have permission to share fully).

predictProblem.zip

helske commented 6 years ago

Thanks. There are couple of issues in your example:

First, the argument for the new data is called newdata, whereas in your script you use argument testData which is ignored as there is not such argument for the predict method.

Second, as you have NA values in the Z array, it is impossible to predict the corresponding values of y (as y_t = link_function(Z_t*alpha_t)). So that causes the issue where there are missing values in the importance samples, which then cause issues in the later computations.

While you cannot get predictions for those y values, you can still get predictions for others, if you just set NA values in Z for example to zero. That shouldn't affect the predictions of other observations as corresponding values of y are still NA.

I now added the NAOK argument anyway so things should work now as expected.