Open christophsax opened 5 years ago
I get that est$Y_median
is better than est$values
. Why are we not storing est$Y_median
for all series then? It is only Ystore
, not Y_median
that is big.
And is there any reason we keep Ystore
if keep_posterior = "INDPRO"
. After all, that is the big thing, and it appears we need just Y_median
?
Answering myself: It may be interesting by itself, that is why we may want to plot it.
So we could still use keep_posterior = "INDPRO"
to do that, but we could always store Y_median
for all series in any case, because these are the better forecasts?
We could, but getting Ymedian
for every series means we would need to keep Ystore
for every series as well in the C++ code at least (Ymedian
is just median values from the whole distribution Ystore
, so to get the medians you need the whole distribution), so there's not getting around having to store a huge object.
Let's say we have a big model with 50 series, that would take us:
n <- 500
m <- 50
runs <- 1000
obj <- array(rnorm(n * m * runs), dim = c(n, m, runs))
format(object.size(obj), units = "Mb")
#> [1] "190.7 Mb"
Created on 2019-07-15 by the reprex package (v0.3.0)
Since this is temporary, this seems fine for modern memory sizes.
I think the advantages are substantial:
predict()
would just point to Ymedian
, and we don't have to explain the awkward situation that we have two kinds of forecasts.
Predicted values are more accurate, i.e., have lower MSE.
A minimal forecast model does not require keep_posterior
, but is simply dfm
around any multiple time series. That's the simplicity I am looking for.
I thought there may be a trick to the get the median iteratively, but I guess I was wrong.
Still, let's say 70 MB for a ususal 20 Series model seems fine to me.
Moved discussion from code.
Seth's comment
I think there is some confusion here between using the posterior median for parameters and the posterior median for the actual series of interest. Our MCMC simulations work as follows:
(1) Simulate factors. From this we can also derive simulated fitted values of observed series (just the loadings times the factors). By default none of this is stored. However, if
keep_posterior
is specified, once the algorithm is through the "burn" iterations and into the "simulate" iterations, the program will store fitted values forkeep_posterior
only at each iteration.(2) Simulate parameters. Once the algorithm is through the "burn" iterations and into the "simulate" iterations these are always stored.
(3) Once simulations are complete, we collect posterior medians for all parameter values.
(4) Posterior medians are then fed into the disturbance smoother along with observations to get fitted values (including nowcasts/forecasts and missing observations) for all series. However, because we are using posterior median parameter estimates, these fitted values are subject to identification uncertainty (in addition to the usual uncertainty from parameter estimation and shocks to the model). However, simulated observations from (1), if
keep_posterior
is specified, do not suffer from identification uncertainty. Thus posterior medians from simulated observables will always be more accurate.Just to be clear, errors in fitted values enter the model in 3 ways (ignoring model misspecification):
(1) Shocks to the model. Suppose we knew the “true” model parameters (B, Q, H and R) --- or from a Bayesian point of view, suppose model parameters had zero variance. Estimates would still not be exactly accurate due to shocks to the model from Q and R (shocks to the transition equation and observation equation respectively).
(2) Parameter estimation. We don’t know the “true” parameters. Our Bayesian estimates instead give us a distribution for parameters. Because using posterior medians is not exactly right, this introduces a second source of error in the model.
(3) Identification uncertainty. Factor models are underidentified --- there are many different rotations of the parameters that will fit observed data. For this reason there is no guarantee our simulations will be drawing from the same rotation of the parameters at each iteration. This introduces yet another source of error in the model. However, using the posterior median for simulated observations eliminates this source of error. This is because each individual simulation is a valid representation of the model --- the problem only arises when drawing parameters for many simulations which may wobble around a bit. Imagine, for example, using the transition matrix B from one rotation of the model and the loadings H from a different rotation. Obviously things won’t line up quite right.