srlanalytics / bdfm

Bayesian dynamic factor model estimation and predictive statistics including nowcasting and forecasting
MIT License
5 stars 6 forks source link

Minimal GDP example #35

Closed christophsax closed 5 years ago

christophsax commented 5 years ago

Probably me feeding it wrongly, but should get a nicer error.

library(bdfm)

# drop outliers (optional but gets rid of some wierd stuff)
econ_us[abs(scale(econ_us)) > 4] <- NA

logs <- c(
  "W068RCQ027SBEA",
  "PCEDG",
  "PCEND",
  "JTSJOL",
  "INDPRO",
  "CSUSHPINSA",
  "HSN1F",
  "TSIFRGHT",
  "IPG2211S",
  "DGORDER",
  "AMTMNO",
  "CPILFESL",
  "ICSA"
)

diffs <- setdiff(colnames(econ_us), c("A191RL1Q225SBEA", 'W068RCQ027SBEA', "USSLIND"))

m <- dfm(econ_us, factors = 3, pre_differenced = "A191RL1Q225SBEA", logs = logs, diffs = diffs)
#> Draws Non-Stationary
#> Error in EstDFM(B = B_in, Bp = Bp, Jb = Jb, lam_B = lam_B, q = q, nu_q = nu_q, : c++ exception (unknown reason)

Created on 2019-02-17 by the reprex package (v0.2.1)

christophsax commented 5 years ago

Works with two factors but would be good to have a better error. Why are 3 factors not working?

library(bdfm)

# drop outliers (optional but gets rid of some wierd stuff)
econ_us[abs(scale(econ_us)) > 4] <- NA

logs <- c(
  "W068RCQ027SBEA",
  "PCEDG",
  "PCEND",
  "JTSJOL",
  "INDPRO",
  "CSUSHPINSA",
  "HSN1F",
  "TSIFRGHT",
  "IPG2211S",
  "DGORDER",
  "AMTMNO",
  "CPILFESL",
  "ICSA"
)

diffs <- setdiff(colnames(econ_us), c("A191RL1Q225SBEA", 'W068RCQ027SBEA', "USSLIND"))

dfm(econ_us, factors = 2, pre_differenced = "A191RL1Q225SBEA", logs = logs, diffs = diffs)
#> 
#> Call:
#> dfm(data = econ_us, factors = 2, logs = logs, diffs = diffs, 
#>     pre_differenced = "A191RL1Q225SBEA")
#> 
#> Bayesian dynamic factor model with 2 factor(s) and 3 lag(s).
#> Log Likelihood: -43317.98  BIC: 87435.47

Created on 2019-02-18 by the reprex package (v0.2.1)

srlanalytics commented 5 years ago

The core problem here is stationary. I've updated the script and hopefully it should be working now. Issues were:

-Dropping outliers should only happen once we've taken logs and differences; since the raw data is non-stationary "outliers" doesn't have a particularly meaningful interpretation. To deal with this I've added a new argument to the input: outlier_threshold with a default of 4. -US government expenditures needs to be log differenced (it wasn't in the earlier example) -Identification needs to be strong. With lots of missing data the former default ID routine (pc_full) was often not strong enough, leading to non-stationary draws. I've changed the default to pc_long which should ensure we're drawing from a stationary distribution.

Regarding c++ errors, any c++ error is almost certainly due to non-stationary draws. If a c++ error shows up, the two things to check are:

  1. Is the input data stationary? If not, factors are likely to explode (get arbitrarily large in magnitude and break iterations)
  2. Are factors identified? This should be pretty well taken care of with identification = 'pc_long'. However, there's also the option for users to specify series used to identify factors manually using column names of the data or index values, such as identification = c(2, 3, 4, 5)
christophsax commented 5 years ago

runs through now, and we try to take care of these things automatically.