srlanalytics / bdfm

Bayesian dynamic factor model estimation and predictive statistics including nowcasting and forecasting
MIT License
5 stars 6 forks source link

posterior median for parameters and actual series of interest #83

Open christophsax opened 5 years ago

christophsax commented 5 years ago

Moved discussion from code.

Seth's comment

I think there is some confusion here between using the posterior median for parameters and the posterior median for the actual series of interest. Our MCMC simulations work as follows:

(1) Simulate factors. From this we can also derive simulated fitted values of observed series (just the loadings times the factors). By default none of this is stored. However, if keep_posterior is specified, once the algorithm is through the "burn" iterations and into the "simulate" iterations, the program will store fitted values for keep_posterior only at each iteration.

(2) Simulate parameters. Once the algorithm is through the "burn" iterations and into the "simulate" iterations these are always stored.

(3) Once simulations are complete, we collect posterior medians for all parameter values.

(4) Posterior medians are then fed into the disturbance smoother along with observations to get fitted values (including nowcasts/forecasts and missing observations) for all series. However, because we are using posterior median parameter estimates, these fitted values are subject to identification uncertainty (in addition to the usual uncertainty from parameter estimation and shocks to the model). However, simulated observations from (1), if keep_posterior is specified, do not suffer from identification uncertainty. Thus posterior medians from simulated observables will always be more accurate.

Just to be clear, errors in fitted values enter the model in 3 ways (ignoring model misspecification):

(1) Shocks to the model. Suppose we knew the “true” model parameters (B, Q, H and R) --- or from a Bayesian point of view, suppose model parameters had zero variance. Estimates would still not be exactly accurate due to shocks to the model from Q and R (shocks to the transition equation and observation equation respectively).

(2) Parameter estimation. We don’t know the “true” parameters. Our Bayesian estimates instead give us a distribution for parameters. Because using posterior medians is not exactly right, this introduces a second source of error in the model.

(3) Identification uncertainty. Factor models are underidentified --- there are many different rotations of the parameters that will fit observed data. For this reason there is no guarantee our simulations will be drawing from the same rotation of the parameters at each iteration. This introduces yet another source of error in the model. However, using the posterior median for simulated observations eliminates this source of error. This is because each individual simulation is a valid representation of the model --- the problem only arises when drawing parameters for many simulations which may wobble around a bit. Imagine, for example, using the transition matrix B from one rotation of the model and the loadings H from a different rotation. Obviously things won’t line up quite right.

christophsax commented 5 years ago

I get that est$Y_median is better than est$values. Why are we not storing est$Y_median for all series then? It is only Ystore, not Y_median that is big.

christophsax commented 5 years ago

And is there any reason we keep Ystore if keep_posterior = "INDPRO". After all, that is the big thing, and it appears we need just Y_median?

Answering myself: It may be interesting by itself, that is why we may want to plot it.

So we could still use keep_posterior = "INDPRO" to do that, but we could always store Y_median for all series in any case, because these are the better forecasts?

SethOttoQuant commented 5 years ago

We could, but getting Ymedian for every series means we would need to keep Ystore for every series as well in the C++ code at least (Ymedian is just median values from the whole distribution Ystore, so to get the medians you need the whole distribution), so there's not getting around having to store a huge object.

christophsax commented 5 years ago

Let's say we have a big model with 50 series, that would take us:


n <- 500
m <- 50
runs <- 1000

obj <- array(rnorm(n * m * runs), dim = c(n, m, runs))
format(object.size(obj), units = "Mb")
#> [1] "190.7 Mb"

Created on 2019-07-15 by the reprex package (v0.3.0)

Since this is temporary, this seems fine for modern memory sizes.

I think the advantages are substantial:

christophsax commented 5 years ago

I thought there may be a trick to the get the median iteratively, but I guess I was wrong.

Still, let's say 70 MB for a ususal 20 Series model seems fine to me.