srlanalytics / bdfm

Bayesian dynamic factor model estimation and predictive statistics including nowcasting and forecasting
MIT License
5 stars 6 forks source link

Automatically take logs and diffs of input series #55

Closed srlanalytics closed 5 years ago

srlanalytics commented 5 years ago

I've written a very simple function to automatically take logs and differences of input data. It uses a simple AR(1) model yt = a y{t-1} + e_t

The null hypothesis is that a = 1 and the rule is as follows:

This is determined using two one-sided tests. Right now I have it set so that if our point estimate for a is more than one standard deviation above 1 (and have no zero or negative values), take logs, but it must be 2 standard deviations below one not to take differences, but we can tweak this as we go.

Seems very simple but works well, at least for our us_gdp.R example.

codecov-io commented 5 years ago

Codecov Report

Merging #55 into master will increase coverage by 0.58%. The diff coverage is 40.83%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #55      +/-   ##
==========================================
+ Coverage   59.43%   60.02%   +0.58%     
==========================================
  Files           8        9       +1     
  Lines        1425     1476      +51     
==========================================
+ Hits          847      886      +39     
- Misses        578      590      +12
Impacted Files Coverage Δ
R/MLdfm.R 0% <0%> (ø) :arrow_up:
R/utils.R 3.44% <0%> (+0.5%) :arrow_up:
R/auto_process.R 100% <100%> (ø)
R/bdfm.R 48.85% <13.04%> (-0.29%) :arrow_down:
R/PCdfm.R 71.42% <14.28%> (-1.2%) :arrow_down:
src/BDFM.cpp 70.75% <48.64%> (-0.96%) :arrow_down:
R/dfm.R 55.76% <50%> (-0.24%) :arrow_down:
R/dfm_core.R 62.72% <92.85%> (+3.72%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update fa33e5c...7dcdb4d. Read the comment docs.

christophsax commented 5 years ago

Finally back to bdfm. Should this still be merged?

srlanalytics commented 5 years ago

Yes. Sorry the last thing I pushed failed... I didn't get a chance to have a look on Travis.

On Sat, 16 Mar 2019, 02:21 Christoph Sax, notifications@github.com wrote:

Finally back to bdfm. Should this still be merged?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/srlanalytics/bdfm/pull/55#issuecomment-473525518, or mute the thread https://github.com/notifications/unsubscribe-auth/AZHEMyHnkZALM2_h5deaSZwQX59z9RwFks5vXOHcgaJpZM4bmTx0 .

christophsax commented 5 years ago

In principle, I like the call below a lot. (I dislike store_idx, how about target?)

However it fails now:

(This also breaks the vignette, and causes travis to fail.)

library(bdfm)
m <- dfm(data = econ_us, factors = 3, pre_differenced = "A191RL1Q225SBEA", store_idx = "A191RL1Q225SBEA")
#> Error in DSmooth(B = B, Jb = Jb, q = q, H = H, R = R, Y = Y, freq = freq[-(1:m)], : matrix multiplication: incompatible matrix dimensions: 5x15 and 1x15

Created on 2019-03-17 by the reprex package (v0.2.1)

Other observations

srlanalytics commented 5 years ago

Regarding the call

m <- dfm(data = econ_us, factors = 3, pre_differenced = "A191RL1Q225SBEA", store_idx = "A191RL1Q225SBEA")

There was a bug in some of the additional output I had wanted from the models... should be fine now.

Regarding the other questions:

x <- rnorm(10000, 0, 1)
y <- 2*x  + rnorm(10000, 0, 1)
#plug in arbritrary missing values
y[c(12,13,15,16,23,45,67,34,35,56,78,1000, 1003, 1213, 1234, 3455, 6788, 9999)] <- NA
x[c(23,34,45,56,67,78,79,100,101,1002, 1234,1245, 2345, 3566, 3468, 4567, 4569)] <- NA

start_time <- Sys.time()
est <- lm(y ~ x)
end_time <- Sys.time()

lm_time <- end_time - start_time

start_time <- Sys.time()
est <- UVreg(x,y)
#> Error in UVreg(x, y): could not find function "UVreg"
end_time <- Sys.time()

UV_time <- end_time - start_time

as.numeric(lm_time)/as.numeric(UV_time)
[1] 3.997609

Created on 2019-03-19 by the reprex package (v0.2.1)

UVreg in this example is about 4 times faster than lm() because it has so much less to do!

christophsax commented 5 years ago

Some more on lm.

see also Dirk’s slides

library(microbenchmark)
library(bdfm)

n <- 100 ; p <- 1
X <- matrix(rnorm(n * p), n, p) # no intercept!
y <- rnorm(n)

# first run needs to load some stuff, outside of competition
first.run <- bdfm:::UVreg(X, y)

microbenchmark(
  lm = {m0 <- lm(y ~ 0 + X)},
  lm.fit = {m1 <- lm.fit(x=X,y=y)},
  .lm.fit = {m2 <- .lm.fit(x=X,y=y)},
  uvreg = {m3 <- bdfm:::UVreg(X, y)}
)
#> Unit: microseconds
#>     expr     min       lq      mean   median       uq      max neval
#>       lm 464.389 489.9860 568.64200 542.9470 585.2460 2858.815   100
#>   lm.fit  30.740  36.4000  48.00131  42.8280  48.2255  478.146   100
#>  .lm.fit   3.653   6.3125   8.22999   7.6855   9.7755   31.620   100
#>    uvreg  22.199  30.8080  37.76772  38.0175  43.4510   79.224   100

Created on 2019-03-19 by the reprex package (v0.2.1)

srlanalytics commented 5 years ago

Because we're not including an intercept term all that's really going on is dividing scalars. UVreg is just calculating as_scalar(trans(x)*y)/as_scalar(trans(x)*(x)). There's no need for any fancy matrix anything here... it's just division. I had only included inv_sympd(trans(X)*X) initially because the default was to include an intercept term (in which case trans(X)*X is a matrix, which it's not here).

Most of the computational time of UVreg() (not that there's much) is actually taken up dropping NA values while keeping the results correct (i.e. we can't just use y <- y[is.finite(y)]). lm.fit() and .lm.fit() are great, but don't deal with the NA issue.

christophsax commented 5 years ago

Thanks, sounds perfect! Sorry for the hold up.