steve-the-bayesian / BOOM

A C++ library for Bayesian modeling, mainly through Markov chain Monte Carlo, but with a few other methods supported. BOOM = "Bayesian Object Oriented Modeling". It is also the sound your computer makes when it crashes.
GNU Lesser General Public License v2.1
36 stars 14 forks source link

predict method does not use old.data correctly. #26

Open steve-the-bayesian opened 5 years ago

steve-the-bayesian commented 5 years ago

Dear Steve,

I write you from CIG group of the Technical University of Madrid. We
are working in a project where we are trying to apply Bayesian
Structural Time Series to generate a model that identifies a time
series structure.

Our objective is not forcasting from the last known value in the
training data, instead we aim to predict using other starting evidence
with previous time stamps (that can be equal to the ones of the
training data).

We have some sample of the same time series in the same timestamp, we
use this data to train the model and then we would evaluate the
results of the prediction over each sample, so that we are using the
predict function with the parameter olddata defined, but we have some
problems:

  1. The predict function doesn´t use olddata without regression component We have included lags of the time series as regressor variables to solve that. We are working now with a model that has regressor variables, so we
    add the newdata parameter to the predict function. The function gives
    us a solution using the last values as first data, but we want to
    introduce information of previous data to force a solution focus on
    one of the sample of the time series.

  2. When we combine olddata with new data to get a prediction, the
    function shows the next error:

Error: number.of.time.points > 0 is not TRUE In addition: Warning message: In is.na(observed.data$response) : is.na() applied to non-(list or vector) of type 'NULL'

We have debugged the code and it seems like the predict function
removes the response variable on the .FormatObservedDataForPredictions
step. Looking into this function, the function .ExtractResponse
doesn´t get the response variable name from the formula parameter.

We are not sure if this is a code problem or that we don´t know how to
use the olddata parameter properly because we haven´t found any example.

Thank you for your attention, Gabriel and David

gvalverd commented 5 years ago

First problem

The olddata parameter don´t combine with olddata.tiemstamp parameter, and don´t change enough the results.

Get data

library(tidyverse, quietly = TRUE)
library(bsts, quietly = TRUE)  
library(data.table)
data(iclaims)
.data <- initial.claims
claims <- .data$iclaimsNSA
plot(claims, ylab = "")

Fit model

In this step just fit a bsts model that include Local linear trend and seasonal components.

model_components <- list()
summary(model_components <- AddLocalLinearTrend(model_components, 
                                                y = claims))
summary(model_components <- AddSeasonal(model_components, y = claims, 
                                        nseasons  = 52))
fit <- bsts(claims, model_components, niter = 2000)

Predictions

First test

Error obtained when trying to include the olddata and olddata.timestamps parameters. Also, include de validation of that errors it is not real, number of observations and lenght of timestamp are equal.

pred_wolddata_ts <- try(predict(fit, 
                            horizon = 100, 
                            burn = burnin, 
                            quantiles = c(.05, .95),
                            olddata = claims[50:80],
                            olddata.timestamps = index(claims[50:80]),
                            seed = 2408))

print(paste0("Difference between length of data and timestamp: ",length(index(claims[50:80])) - length(claims[50:80])))

pred_wolddata_ts

Error : number.of.observations == length(timestamps) is not TRUE [1] "Difference between length of data and timestamp: 0" [1] "Error : number.of.observations == length(timestamps) is not TRUE\n" attr(,"class") [1] "try-error" attr(,"condition") <simpleError: number.of.observations == length(timestamps) is not TRUE>

Second test

Second experiment, in this case removing timestamp, just olddata parameter has been defined using the 50:80 observations from training data.

pred_wolddata <- try(predict(fit, 
                            horizon = 100, 
                            burn = burnin, 
                            quantiles = c(.05, .95),
                            olddata = claims[50:80],
                            seed = 2408))`

and a second one with training data from 20 to 30 that is a very diferente step in timeseries, we want to validate the effect of alldata in the prediction

pred_wolddata2 <- predict(fit, 
                          horizon = 100, 
                          burn = burnin, 
                          quantiles = c(.05, .95),
                          olddata = claims[20:30],
                          seed = 2408)
plot(pred_wolddata2)

But the results are equals:

 print(paste0("Acumulated diference between training[20:30] and training[50:80] data:",sum(pred_wolddata$mean - pred_wolddata2$mean)))

[1] "Acumulated diference between training[20:30] and training[50:80] data:0"

And no very different from don´t use olddata:

[1] "Acumulated diference between without olddata prediction and training[20:30] data: 60.96"

image

gvalverd commented 5 years ago

Second problem

After show some problems with the olddata parameter, let`s explain the second issue.

It is not possible combine newdata parameter for regression and add the olddata parameter to define the initial situation and values to generate a prediction.

We tried use newdata as a alternative to olddata adding some lags of the objective variable, that contain information from past. This alternative solve in some way the forecast problem for diferents timestamps but return and error in combination with olddata parameter.

claims_dt <- as.data.table(claims)
claims_dt[, x_1 := shift(x,n = 1,type = "lag")]
claims_dt[, x_2 := shift(x,n = 2,type = "lag")]
claims_dt[, x_3 := shift(x,n = 3,type = "lag")]
claims_dt[, x_4 := shift(x,n = 4,type = "lag")]
claims_dt <- claims_dt[!1:4,]

Fit model

model_components <- list()
summary(model_components <- AddLocalLinearTrend(model_components, 
                                                y = claims))
summary(model_components <- AddSeasonal(model_components, 
                                        y = claims, 
                                        nseasons  = 52))
fit_reg <- bsts(data = claims_dt[1:400,],
            formula = x ~.,
            state.specification = model_components, 
            niter = 2000)

Predictions

First test: Same problem that before issue
pred_w_newolddata_ts <- try(predict(fit, 
                          horizon = 100, 
                          burn = burnin, 
                          quantiles = c(.05, .95),
                          olddata = claims_dt[50:80,],
                          olddata.timestamps = index(claims_dt[50:80,]),
                          newdata = claims_dt[401:452,!"x",with = F],
                          timestamps = 401:452,
                          seed = 2408))

pred_w_newolddata_ts
print(paste0("Difference of newdata length and newdata.timestamp length):",length(401:452) - length(claims_dt[401:452,!"x",with = F][[1]])))
print(paste0("Difference of olddata length and olddata.timestamp length):",length(index(claims[50:80])) - length(claims[50:80])))

Error : number.of.observations == length(timestamps) is not TRUE [1] "Error : number.of.observations == length(timestamps) is not TRUE\n" attr(,"class") [1] "try-error" attr(,"condition") <simpleError: number.of.observations == length(timestamps) is not TRUE> [1] "Difference of newdata length and newdata.timestamp length):0" [1] "Difference of olddata length and olddata.timestamp length):0"

So as before remove olddata.timestamp:

pred_w_newolddata <- try(predict(fit_reg, 
                         horizon = 100, 
                         burn = burnin, 
                         quantiles = c(.05, .95),
                         olddata = claims_dt[20:30,],
                         newdata = claims_dt[401:452,!"x",with = F],
                         timestamps = 401:452,
                         seed = 2408))
pred_w_newolddata

When we combine olddata with new data to get a prediction, the function shows the next error:

Error: number.of.time.points > 0 is not TRUE In addition: Warning message: In is.na(observed.data$response) : is.na() applied to non-(list or vector) of type 'NULL'

We have debugged the code and it seems like the predict function removes the response variable on the .FormatObservedDataForPredictions step. Looking into this function, the function .ExtractResponse doesn´t get the response variable name from the formula parameter. bsts_problems.pdf