srlanalytics commented 5 years ago

I've written a very simple function to automatically take logs and differences of input data. It uses a simple AR(1) model yt = a y{t-1} + e_t

The null hypothesis is that a = 1 and the rule is as follows:

if a is significantly less than 1, do nothing (the data is already stationary)
if a is significantly greater than 1, take log differences (the data grows exponentially)
if a is not significantly different than one, take differences (it's not stationary) but not logs (it doesn't grow exponentially)

This is determined using two one-sided tests. Right now I have it set so that if our point estimate for a is more than one standard deviation above 1 (and have no zero or negative values), take logs, but it must be 2 standard deviations below one not to take differences, but we can tweak this as we go.

Seems very simple but works well, at least for our us_gdp.R example.

codecov-io commented 5 years ago

Codecov Report

Merging #55 into master will increase coverage by 0.58%. The diff coverage is 40.83%.

@@            Coverage Diff             @@
##           master      #55      +/-   ##
==========================================
+ Coverage   59.43%   60.02%   +0.58%     
==========================================
  Files           8        9       +1     
  Lines        1425     1476      +51     
==========================================
+ Hits          847      886      +39     
- Misses        578      590      +12

Impacted Files	Coverage Δ
R/MLdfm.R	`0% <0%> (ø)`	:arrow_up:
R/utils.R	`3.44% <0%> (+0.5%)`	:arrow_up:
R/auto_process.R	`100% <100%> (ø)`
R/bdfm.R	`48.85% <13.04%> (-0.29%)`	:arrow_down:
R/PCdfm.R	`71.42% <14.28%> (-1.2%)`	:arrow_down:
src/BDFM.cpp	`70.75% <48.64%> (-0.96%)`	:arrow_down:
R/dfm.R	`55.76% <50%> (-0.24%)`	:arrow_down:
R/dfm_core.R	`62.72% <92.85%> (+3.72%)`	:arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update fa33e5c...7dcdb4d. Read the comment docs.

christophsax commented 5 years ago

Finally back to bdfm. Should this still be merged?

srlanalytics commented 5 years ago

Yes. Sorry the last thing I pushed failed... I didn't get a chance to have a look on Travis.

On Sat, 16 Mar 2019, 02:21 Christoph Sax, notifications@github.com wrote:

Finally back to bdfm. Should this still be merged?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/srlanalytics/bdfm/pull/55#issuecomment-473525518, or mute the thread https://github.com/notifications/unsubscribe-auth/AZHEMyHnkZALM2_h5deaSZwQX59z9RwFks5vXOHcgaJpZM4bmTx0 .

christophsax commented 5 years ago

In principle, I like the call below a lot. (I dislike store_idx, how about target?)

However it fails now:

(This also breaks the vignette, and causes travis to fail.)

library(bdfm)
m <- dfm(data = econ_us, factors = 3, pre_differenced = "A191RL1Q225SBEA", store_idx = "A191RL1Q225SBEA")

#> Error in DSmooth(B = B, Jb = Jb, q = q, H = H, R = R, Y = Y, freq = freq[-(1:m)], : matrix multiplication: incompatible matrix dimensions: 5x15 and 1x15

^{Created on 2019-03-17 by the reprex package (v0.2.1)}

Other observations

Any reason why you wrote UVreg? lm uses qr decomposition which should be more robust.
Any reason why all the .cpp Stuff is in one file. Would be easier if it is organized more like the R folder.

srlanalytics commented 5 years ago

Regarding the call

m <- dfm(data = econ_us, factors = 3, pre_differenced = "A191RL1Q225SBEA", store_idx = "A191RL1Q225SBEA")

There was a bug in some of the additional output I had wanted from the models... should be fine now.

Regarding the other questions:

I used UVreg (univariate regression) because it is super basic and doesn't produce nearly so much output as lm(). UVreg uses Armadillo's inv_sympd(trans(X)*X); I do this because we'll need the inverse of trans(X)*X to calculate the variance of the estimated coefficients. Since we have to use it anyway, I also use it to calculate coefficients themselves instead of solve(trans(X)*X, trans(X)*y). The whole thing is super lightweight, so it hopefully doesn't slow anything down! Here's a little example:

x <- rnorm(10000, 0, 1)
y <- 2*x  + rnorm(10000, 0, 1)
#plug in arbritrary missing values
y[c(12,13,15,16,23,45,67,34,35,56,78,1000, 1003, 1213, 1234, 3455, 6788, 9999)] <- NA
x[c(23,34,45,56,67,78,79,100,101,1002, 1234,1245, 2345, 3566, 3468, 4567, 4569)] <- NA

start_time <- Sys.time()
est <- lm(y ~ x)
end_time <- Sys.time()

lm_time <- end_time - start_time

start_time <- Sys.time()
est <- UVreg(x,y)
#> Error in UVreg(x, y): could not find function "UVreg"
end_time <- Sys.time()

UV_time <- end_time - start_time

as.numeric(lm_time)/as.numeric(UV_time)

[1] 3.997609

^{Created on 2019-03-19 by the reprex package (v0.2.1)}

UVreg in this example is about 4 times faster than lm() because it has so much less to do!

The reason all the .cpp stuff is in one folder is that I haven't yet bothered sorting out the headers / include stuff and the later functions depend on all the earlier ones. But yes, we should break it up into more manageable pieces.

christophsax commented 5 years ago

Some more on lm.

see also Dirk’s slides

library(microbenchmark)
library(bdfm)

n <- 100 ; p <- 1
X <- matrix(rnorm(n * p), n, p) # no intercept!
y <- rnorm(n)

# first run needs to load some stuff, outside of competition
first.run <- bdfm:::UVreg(X, y)

microbenchmark(
  lm = {m0 <- lm(y ~ 0 + X)},
  lm.fit = {m1 <- lm.fit(x=X,y=y)},
  .lm.fit = {m2 <- .lm.fit(x=X,y=y)},
  uvreg = {m3 <- bdfm:::UVreg(X, y)}
)
#> Unit: microseconds
#>     expr     min       lq      mean   median       uq      max neval
#>       lm 464.389 489.9860 568.64200 542.9470 585.2460 2858.815   100
#>   lm.fit  30.740  36.4000  48.00131  42.8280  48.2255  478.146   100
#>  .lm.fit   3.653   6.3125   8.22999   7.6855   9.7755   31.620   100
#>    uvreg  22.199  30.8080  37.76772  38.0175  43.4510   79.224   100

^{Created on 2019-03-19 by the reprex package (v0.2.1)}

srlanalytics commented 5 years ago

Because we're not including an intercept term all that's really going on is dividing scalars. UVreg is just calculating as_scalar(trans(x)*y)/as_scalar(trans(x)*(x)). There's no need for any fancy matrix anything here... it's just division. I had only included inv_sympd(trans(X)*X) initially because the default was to include an intercept term (in which case trans(X)*X is a matrix, which it's not here).

Most of the computational time of UVreg() (not that there's much) is actually taken up dropping NA values while keeping the results correct (i.e. we can't just use y <- y[is.finite(y)]). lm.fit() and .lm.fit() are great, but don't deal with the NA issue.

christophsax commented 5 years ago

Thanks, sounds perfect! Sorry for the hold up.

srlanalytics / bdfm

Automatically take logs and diffs of input series #55

Codecov Report

Other observations