Open olshena opened 4 years ago
I've added a function called regressAR which uses regression to build an AR model. It can incorporate multiple variables specified via a data frame called x. I wasn't exactly sure how to turn the results from the regression into a buildAR object, so I just used them as input for a final call to buildAR. Please check code to see if this works as you envision.
Also, maybe we should pull the lagging indicator functionality out as a separate issue?
Yes, let's worry about lagging as a separate issue.
I am struggling to follow the logic of the code. What it is supposed to do is use regression to combine separate AR models into a single prediction. The way the logic makes sense to me is that it should be a middle layer between buildAR and predictAR in simulateAR. buildAR and predictAR would loop over multiple predictors. The regression parameters would be fit after buildAR and would be arguments to predictAR. Because this is not what is happening, I suspect that the regression is being brought in too early.
Here is a follow up to our conversation of 8/24.
1) The training set fit is currently proper. The data is submitted in column form. Each column is fit separately and then brought together with least squares (with no intercept).
2) After the training set is fit, the residuals need to be calculated and then lowess can be used to model the mean to residual variance relationship.
3) The buildarobj needs to contain all columns of data, the betas from lowess, the phis, and the residual information.
4) In the predictions the mean and residuals are independent except the residuals are proportional to the mean (via lowess).
5) The prediction mean model can operate on each variable separately and then bring the data together using the least squares coefficients. The predicted value for each variable can be estimated across all pdays before bringing them together if that is the easier way to code.
6) Once the means are estimated, the residuals can be added to the means by sampling with variance proportional to the predicted. Because these residuals are on the scaled of the multivariable model, the multivariable mean must estimated before the residuals can be estimated.
A few problems with the current version:
I fixed the bugs and will move the rounding option to its own issue
I tested the new function. simulateAR runs fine. I got this error while running regressAR:
regress.sep1.fourweek <- regressAR(vec=chosp, x=x, wsize=28, method = "unweighted", pdays=14, nsim=10000, seed=12345, output_type = "max") Error: Problem with
mutate()
inputy_hat_mean
. ✖ object 'w' not found ℹ Inputy_hat_mean
ispredict(...)
. ℹ The error occurred in group 1: pred_set = 1. Runrlang::last_error()
to see where the error occurred.
This should be fixed and pushed to github now.
It looks it could be right. Regression coefficients with random second variate are 0.995 for hospitalizations and -0.014 for the second variate. The second variable predictions are weird but I need to generate better data.
The basic idea of adding variables to the model is that each variable, including the autoregressive outcome (so far hospitalizations), will contribute predictions and those predictions will be weighed by a second least squares step.
I am imagining 3 kinds of variables:
The key for the manuscript is modeling test positivity along with hospitalizations. I think we should start with test positivity as nonlinear and non-stationary. We could model it just like hospitalizations except with the initial phi(t)=y(t)/x(t-1), where x(t-1) is for test positivity. We would have a kernel for test positivity just like we do for hospitalizations, but it should be specified separately.
One more issue is that test positivity is a lagging indicator. We should be able to specify a lag as a parameter to the model. This make the data structure more complicated. Also, we want to make sure that for a variable with a lag we use the real data rather than simulated data where we can. For instance, if we decided hospitalizations lagged hospitalizations by two weeks, the goal was to predict hospitalizations in three weeks, the first two weeks of predictions would have real test positivity data and only the third would need to be simulated.
Recall that we have separated the mean function from the error function, so error would be added to our mean predictions after weighting all the predictions (by least squares).
For the linear variables, we will need to decide whether to keep them fixed or do (linear?) extrapolation in the predictions.