olshena / COVIDNearTerm

1 stars 0 forks source link

Adding additional variables to autoregressive model #18

Open olshena opened 4 years ago

olshena commented 4 years ago

The basic idea of adding variables to the model is that each variable, including the autoregressive outcome (so far hospitalizations), will contribute predictions and those predictions will be weighed by a second least squares step.

I am imagining 3 kinds of variables:

  1. Nonlinear and non-stationary. Hospitalizations are one such variable.
  2. Non-stationary linear variables. The impact of this variable would change over time, but the variable is linear in its coefficient. Month of the pandemic is one such variable.
  3. Stationary linear variables. The impact of this variable would be the same across the series. Median age of those hospitalized is a possible variable of this type.

The key for the manuscript is modeling test positivity along with hospitalizations. I think we should start with test positivity as nonlinear and non-stationary. We could model it just like hospitalizations except with the initial phi(t)=y(t)/x(t-1), where x(t-1) is for test positivity. We would have a kernel for test positivity just like we do for hospitalizations, but it should be specified separately.

One more issue is that test positivity is a lagging indicator. We should be able to specify a lag as a parameter to the model. This make the data structure more complicated. Also, we want to make sure that for a variable with a lag we use the real data rather than simulated data where we can. For instance, if we decided hospitalizations lagged hospitalizations by two weeks, the goal was to predict hospitalizations in three weeks, the first two weeks of predictions would have real test positivity data and only the third would need to be simulated.

Recall that we have separated the mean function from the error function, so error would be added to our mean predictions after weighting all the predictions (by least squares).

For the linear variables, we will need to decide whether to keep them fixed or do (linear?) extrapolation in the predictions.

kikapp commented 4 years ago

I've added a function called regressAR which uses regression to build an AR model. It can incorporate multiple variables specified via a data frame called x. I wasn't exactly sure how to turn the results from the regression into a buildAR object, so I just used them as input for a final call to buildAR. Please check code to see if this works as you envision.

Also, maybe we should pull the lagging indicator functionality out as a separate issue?

olshena commented 4 years ago

Yes, let's worry about lagging as a separate issue.

olshena commented 4 years ago

I am struggling to follow the logic of the code. What it is supposed to do is use regression to combine separate AR models into a single prediction. The way the logic makes sense to me is that it should be a middle layer between buildAR and predictAR in simulateAR. buildAR and predictAR would loop over multiple predictors. The regression parameters would be fit after buildAR and would be arguments to predictAR. Because this is not what is happening, I suspect that the regression is being brought in too early.

olshena commented 4 years ago

Here is a follow up to our conversation of 8/24. 1) The training set fit is currently proper. The data is submitted in column form. Each column is fit separately and then brought together with least squares (with no intercept). 2) After the training set is fit, the residuals need to be calculated and then lowess can be used to model the mean to residual variance relationship. 3) The buildarobj needs to contain all columns of data, the betas from lowess, the phis, and the residual information. 4) In the predictions the mean and residuals are independent except the residuals are proportional to the mean (via lowess). 5) The prediction mean model can operate on each variable separately and then bring the data together using the least squares coefficients. The predicted value for each variable can be estimated across all pdays before bringing them together if that is the easier way to code.
6) Once the means are estimated, the residuals can be added to the means by sampling with variance proportional to the predicted. Because these residuals are on the scaled of the multivariable model, the multivariable mean must estimated before the residuals can be estimated.

olshena commented 4 years ago

A few problems with the current version:

  1. I think the latest code has not been committed since the latest commits are from yesterday. I know the intercept was taken out of the regression fit. (Nevermind, I see an update was made, not sure how I missed that.)
  2. When using simulateAR, which I think should still work, I get: Error in buildAR(vec, x, wsize, method = method, seed = seed) : object 'n_vec' not found
  3. When using regressAR, which I think is what should be called now, I get: Error in predict_variable %in% c("Min", "FirstQu", "Median", "Mean", "ThirdQu", :
  4. I realize that we needed to give an option as to whether a variable should be rounded. I envisioned this as hospitalizations, which is always an integer, but we should be more general since test positivity would not be an integer.
kikapp commented 4 years ago

I fixed the bugs and will move the rounding option to its own issue

olshena commented 4 years ago

I tested the new function. simulateAR runs fine. I got this error while running regressAR:

regress.sep1.fourweek <- regressAR(vec=chosp, x=x, wsize=28, method = "unweighted", pdays=14, nsim=10000, seed=12345, output_type = "max") Error: Problem with mutate() input y_hat_mean. ✖ object 'w' not found ℹ Input y_hat_mean is predict(...). ℹ The error occurred in group 1: pred_set = 1. Run rlang::last_error() to see where the error occurred.

kikapp commented 4 years ago

This should be fixed and pushed to github now.

olshena commented 4 years ago

It looks it could be right. Regression coefficients with random second variate are 0.995 for hospitalizations and -0.014 for the second variate. The second variable predictions are weird but I need to generate better data.