signaturescience / focustools

Forecasting COVID-19 in the US
https://signaturescience.github.io/focustools/
GNU General Public License v3.0
0 stars 0 forks source link

initial time series model parameterization #7

Closed vpnagraj closed 3 years ago

vpnagraj commented 3 years ago

consensus among the team: we're going to start with some kind of time series approach for the first model

as we move into initial implementation we need to settle on the specifics of which kind of model (ARIMA, ETS, etc) and which parameters are passed to the model

the fable package provides a really convenient framework to evaluate time series models side-by-side, and @stephenturner has put together some working code:

https://github.com/signaturescience/focustools/blob/master/scratch/fable-scratch.R

i've extended that code in a separate script with some code to view model accuracy measures (we can clean up / combine these two fable-*-scratch.R scripts later):

https://github.com/signaturescience/focustools/blob/master/scratch/fable-accuracy-scratch.R

@chulmelowe after our conversation on the call i'm assigning this issue to you. i guess ultimately we want 1) the most accurate model (related to https://github.com/signaturescience/focustools/issues/5) and 2) a model that is sensible in terms of assumptions about infectious disease dynamics

maybe # 1 is more important than # 2? i can see it both ways.

either way it's probably worth emphasizing that while want to avoid implementing a short-sighted approach, ultimately this is the initial implementation of our model. we have budgeted even more time to evaluate and improve the model in Task 3.

chulmelowe commented 3 years ago

I haven't done much with the incident deaths yet, but I did take a look at the incident cases. Fable is selecting an ARIMA(0, 1, 4) model (i.e., no autoregressive component, first order differencing, and averaging across four time points for the moving average component). Looking at the various plots of the data I'd usually use to guide model selection for an ARIMA model, I agree with the differencing but I would have specified an autoregressive component with a lag of six and no moving average component. My guess is that fable is selecting models using the AIC or some similar fit index and that the model it's selecting is the simplest model with similar fit. I don't necessarily disagree with doing it that way, but I do wonder what we're losing in terms of model flexibility.

vpnagraj commented 3 years ago

@chulmelowe thanks for diving into this.

yeah from what i understand fable::ARIMA() is doing exactly what you suggested ... searching the parameter space to minimize the AIC:

https://fable.tidyverts.org/reference/ARIMA.html

note to self: it looks like we can optionally try to optimize other information criterion and/or constrain the parameter space

vpnagraj commented 3 years ago

@chulmelowe anything else you want to look at here? are we comfortable with the auto ARIMA from fable ... at least for the initial implementation?

if we're good for now feel free to close this issue / mark the project task "done" on the board (https://github.com/signaturescience/focustools/projects/1)

chulmelowe commented 3 years ago

No, I think I'm comfortable using the fable parameterization for now. One other thing I want to look at is what it's doing with the seasonal component of the ARIMA model, but I'll try to get to that today so we can close this out.

chulmelowe commented 3 years ago

Based on the fable documentation, it won't implement the seasonal component without two years of data or the user specifying the seasonal component in the model. Based on that and Pete's recent results, I think the automated model selection is working well as an initial parameterization.

stephenturner commented 3 years ago

Let's look at log transforming @chulmelowe

vpnagraj commented 3 years ago

we've used the log transform (https://github.com/signaturescience/focustools/commit/225292e234950c603d505aac1f21fbd3415be30e) ... but now are not using log transform and restricting the parameter search space instead (808796ceff5c7723f3c105f2ae590f89b2f68124)

in the interest of organizing the issues lets close this one for now. as other time series parameterization questions come up lets create new issues.