impactlab / caltrack

Shared repository for documentation and testing of CalTRACK methods
http://docs.caltrack.org
6 stars 5 forks source link

Daily methods testing and decision criteria #59

Closed matthewgee closed 7 years ago

matthewgee commented 7 years ago

A google doc version of this can be found here.

Over the course of daily methods testing, there are a variety of decision that turn on the overall sensitivity of savings estimation to one choice vs another given the data we are likely to see in California.

In order to make defensible, empirically motivated decision, we need to develop a testing criteria for methodological choices that don’t have an obvious answer or have clear tradeoffs. We proposed that the evaluation of baselining method choices will use the following testing and decision procedure for selecting final specifications

houghb commented 7 years ago

As discussed on last week's call, I've updated this document (see the link above).

I think we need further discussion of how we quantify "significant improvement" in the final bullet point to make sure that a change/addition/improvement is worth doing. Can we impose a threshold for improvements to be considered significant?

jfarland commented 7 years ago

I was able to talk to Mimi last week. I would like to caution that I was only able to describe to her at a very high level the issue at hand (e.g., model selection criterion in an M&V context).

Here's her thoughts: "If we’re trying to say how well does a model predict, I don’t know why we look at R2 at all. We can have a pretty good R2 with lots of variability and have a poor prediction. We can have a pretty good R2 with average DD far from normal and have a poor prediction. We can have a bad R2 with minimal seasonal dependence and minimal variability and have a great prediction. This isn’t about testing whether there’s a relationship to DD, which is what R2 tells us. It’s about how useful the model is as a predictor.

The think I would look at is the accuracy of predicted consumption at the DD values we care about. Either SE(NAC) or possibly some weighted combo of SE(predicted) at different seasons. Or using MSEP instead of SE."

From my perspective, the decision is whether to focus on the predictive power of these models, versus the explanatory power of the model fits. I think Mimi and I both agree on the former as opposed to the latter here (Ken might have a different take). The forecasting literature has a lot of relevant empirical evidence/information on evaluating predictive accuracy. I think I might suggest the family of "scaled" error metrics that have the same advantages as percentages (e.g., MAPE has a common unit across time series) but which are scaled by some common quantity across time series (e.g., divide the out of sample error metric by the within-sample MAE, for example).

I'm interested to hear what other's think before commenting too much more.

mcgeeyoung commented 7 years ago

Closing with the proviso that others can reopen and comment if desired.