[CLOSED] Beta tester input on program rules for aggregations for payable savings be determined

openeemeter / caltrack

Shared repository for documentation and testing of CalTRACK methods

http://docs.caltrack.org

Creative Commons Zero v1.0 Universal

56 stars 14 forks source link

[CLOSED] Beta tester input on program rules for aggregations for payable savings be determined #33

Closed hshaban closed 6 years ago

hshaban commented 6 years ago

Issue by matthewgee Thursday Jan 12, 2017 at 17:05 GMT Originally opened as https://github.com/impactlab/caltrack/issues/32

The implementation of CalTRACK requires program rules for payable savings. Given the availability of historical data to test different rules, the group can offer an opinion on various alternatives.

With this in mind, aggregation rules for monthly site-based savings methods in support of the P4P program were discussed briefly in meeting 18 of the working group where inverse variance weighted means was proposed as an approach, as it deals with the fact that there are homes with poor model fit with high positive or negative savings that can skew the portfolio-level average.

This was written up and opened up for a comment by technical working group members in meeting 19th (https://github.com/impactlab/caltrack-betatest/tree/master/aggregation).

hshaban commented 6 years ago

Comment by jbackusm Monday Jan 30, 2017 at 23:26 GMT

The main observation against IVWM was that the variance is often smaller for homes with smaller overall usage (if for no other reason than the usage being constrained to positive values), which causes the IVWM results to be focused on smaller homes. This is potentially problematic because smaller homes also tend to have smaller savings--the homes with the most potential for savings are often the ones with higher pre-existing usage, and therefore higher variance.

hshaban commented 6 years ago

Comment by matthewgee Tuesday Jan 31, 2017 at 21:48 GMT

@jbackusm Exactly. This is the crux of the issue. IVWM, which comes from the meta-analysis literature, provides you with the most efficient (minimum variance) weighted average, and if the variances were all identical, it would be equivalent to the unweighted sample mean. Having a minimum variance weighted average benefits the program because it will lead to more stable payable savings on portfolios across aggregators and across time, all else equal.

The downside is that, if the forecast errors are correlated with attributes like building size that are also correlated with the random variable, the observations aren't independent of the variance and we'll end up with an efficient estimator (low variance) that is also bias downward.

What this means for the program design is that it faces a classical bias-variance tradeoff. They can choose a rule that will have more stable payments, but at the expense of the average payment being lower, or they can chose an unweighted (or differently weighted) aggregation rule that provides higher payments on average but will have wider swings across aggregators and time, all else equal.

hshaban commented 6 years ago

Comment by gkagnew Tuesday Jan 31, 2017 at 22:26 GMT

I may have missed the full discussion of how this would work. can you point me to notes or a write up? Is it weighted by the variance of the predicting model? How does that work when working with the difference between two models. Either way, and any other way I can think of, the assumption of independence seems wrong. And I suspect we can come up with scenarios where the bias would be upward, for instance where high savings are weighted by well behaved models with a clear and overwhelming cooling trend. Under the circumstances, it seems to me that limiting bias is the primary consideration. Is there any part of this process where variance is taken into consideration? Do savings estimates have to meet some pre-determined precision? If so, I could use a pointer to that discussion as well.

hshaban commented 6 years ago

Comment by matthewgee Monday Feb 13, 2017 at 11:55 GMT

@gkagnew IVWM as a starting proposal for the CalTRACK aggregation rule was introduced in technical working group meeting 18 (see above for link) of the working group with links to the detailed writeup and opened for comment in the CalTRACK draft requirements google doc, but there wasn't an in-depth discussion about alternatives on the technical working group phone calls and there were very few comments in the initial google doc by technical working group members, so it's good they are getting discussed here.

In the current proposal, the IVWM uses the errors from the predicting model to as the estimator for model variance. Thinking about independence assumption, if it's violated in the case of site-level savings estimates, that would imply that the probability distribution of one house's savings is affected by the realization of another, conditional on weather (which has been marginalized out). Are you thinking that the violation comes from non-weather exogenous effects on savings that are correlated within region?

As far as having an aggregation rule that does better in reducing the bias of a portfolio-level estimate of savings, I think we'd need a way of consistently knowing (being able to model) the direction and magnitude of the bias.

Thoughts on how to do this well for a gross savings measure?

The alternatives to IVWM include unweighted means, or using alternative weights (like fractional savings uncertainty, which is outlined in the current spec).

I think the notion of having a pre-determined level of precision for savings inclusion in portfolio estimates is totally sensible. For CalTest, we set a pretty low bar with a minimum R^2 of .05 for inclusion. Any thoughts on having a model fitness criteria and what that number should be?

hshaban commented 6 years ago

Comment by mcgeeyoung Monday Feb 20, 2017 at 07:49 GMT

Absent any further discussion of this, issue is closed.

hshaban commented 6 years ago

Comment by matthewgee Friday Feb 24, 2017 at 12:53 GMT

Great idea from Jarred Metoyer:

"On this particular item I don’t quite understand why both IVWM and un-weighted means wouldn’t both be output and stored. I think you wouldn’t have to open the closed issue if you add an unweighted mean comparison to v1 so the utility can get some indicator if there is a difference and there are comparisons as this roles out. It seems both need to be output for the same data and models for many more data sets (and thus models) so we have some basis for validating or making choices downstream."

This makes a lot of sense. I'm going to go ahead and submit a pull request for updating the monthly method aggregation spec, and I think we should make this the starting position for the daily aggregation spec.

hshaban commented 6 years ago

Comment by matthewgee Thursday May 04, 2017 at 17:57 GMT

Hat tip to @tplagge

As before, we use the 1000-home electricity sample, with (project start - 2 years) -> (project start - 1 year) as the “baseline period” and (project start - 1 year) -> (project start) as the “reporting period.” Variance is defined as the prediction error for the baseline model over the reporting period.

Fractional savings: Unweighted mean ((pred - actual) / actual): 32.4% Median: 2.49% Cauchy distribution fit center: 1.74% IVWM: -96.9% Unweighted mean savings, winsorized at 5-95%: 5.81% Unweighted mean savings, winsorized at 1-99%: 9.11% IVWM, winsorized values (not weights) at 5-95%: -29.8% IVWM, winsorized weights (not values) at 5-95%: 11.5% IVWM, winsorized weights and values at 5-95%: 2.51%

It appears that inverse variance weighting is badly distorted by a small number of sites with large negative savings and very low variance--basically, by sites with near zero usage in the baseline period but significant nonzero usage in the reporting period. (If the usage pattern changes in the middle of the baseline period, the variance can get pushed down to a reasonable value but the fractional savings stays very high.)

Winsorizing these outliers leaves little obvious trend in terms of variance versus savings, and brings the IVWM in line with the median.

Absolute savings: Unweighted mean savings (pred - actual): 331.4 kWh Median: 171.6 kWh IVWM: -315.5 kWh Unweighted mean savings, winsorized at 5-95%: 292.2 kWh Unweighted mean savings, winsorized at 1-99%: 330.5 kWh IVWM, winsorized values (not weights) at 5-95%: -294.3 kWh IVWM, winsorized weights (not values) at 5-95%: 30.8 kWh IVWM, winsorized weights and values at 5-95%: 42.0 kWh

hshaban commented 6 years ago

Comment by tplagge Thursday May 11, 2017 at 18:14 GMT

Update to the above: As suggested on last week's call, subsampling is a good way to assess the quality of these proposed aggregate statistics. A good summary statistic should be a low-scatter measure of centrality for a portfolio (though that is not the only relevant criteria!). By computing the summary statistics on many subsamples, we can get a sense for their distribution. We do expect the median to be the best according to this metric, though that doesn't imply we should just use the median.

Here's what I did:

Take 100 subsamples of 100 homes’ electric data.
For each subsample:
- Compute fractional savings for each home.
- Compute summary quantities (inverse variance-weighted mean, median, etc.)
Take mean and standard deviation of the resulting 100 subsamples’ summary quantities.

For the summary quantities which are winsorized, we set the lowest 5% of fractional savings estimates to the 5th lowest value, and the highest 5% of fractional savings to the 5th highest value--essentially, we truncate (but don’t remove) the outliers. I'll consider other thresholds than 5% as well.

Results:

IVWM, winsorized savings and winsorized weights 2.7% +- 4.0% IVWM, winsorized savings -17.4% +- 33.0% IVWM, winsorized weights -2.6% +- 11.2% IVWM, no winsorization -2.2% +- 45.2% Unweighted mean 30% +- 65% Unweighted mean, winsorized 7.0% +- 4.0% Median 2.6% +- 1.5%

Outlier handling is clearly very important in reducing scatter, especially removing outlier weights. Unweighted mean and straightforward IVWM have very high scatter; median, the lowest as expected. Inverse variance weighting does not appear to be appreciably reducing the scatter, but it's possible that the differences in mean savings estimates here indicates the presence of correlation between variance and changes in usage.

Changing the winsorization threshold (note that each is computed with a different set of subsamples): 50% threshold: 2.6% +- 1.5% (equivalent to the median) 20% threshold: 2.1% +- 2.2% 15% threshold: 2.0% +- 2.2% 12.5% threshold: 2.6% +- 2.6% 10% threshold: 1.9% +- 2.3% 7.5% threshold: 3.6% +- 2.0% 5% threshold: 2.7% +- 4.0% 2.5% threshold: 3.3% +- 5.0% 1% threshold: 0.7% +- 10.4% 0% threshold: -2.2% +- 45.2% (IVWM with no winsorization)

Recommendation: Since IVWM does not appear to be reducing the scatter, I suggest either median or winsorized unweighted mean as the appropriate aggregate quantity. Winsorizing with a threshold of between 5-10% appears to be a happy medium.

hshaban commented 6 years ago

Comment by gkagnew Tuesday May 16, 2017 at 17:35 GMT

I’m trying to catch up with the conversation. Please bear with me. We appear to have moved away from IVWM to a discussion of trimming or winsorizing. I’ve done a little superficial research on winsorizing just to better understand how people think about using this technique, the implications, etc. I put this forward to get a discussion going.

I ran across the following, and it somewhat encapsulated the concerns I have been wanting to voice. Read if first before glancing below at the embarrassing source . . .*

Winsorizing seems particularly offensive to me though - you need to think about why your data needs cleaned in the first place. If you have an entry which is totally out-of-synch with the rest then usually the cause is going to be someone inputting data wrong, or a computer program going haywire. In this case, it makes sense to discard the data point, which is statistically equivalent to treating it as if it were missing-at-random. With winsorizing, its almost like youre assigning some probability to the point being a mistaken entry, and some probability to it being a genuine value, and then averaging these to get the winsorized value. This is probably statistically equivalent to some non-missing-at-random procedure or some sort of mixture model, but its not clear at all what this model actually is. So its difficult to say what your winsorized value actually means. But in any case, if the data genuinely has extreme values which arent data-entry type errors (ie if the data is legitimately being generated by a heavy-tailed distribution) , then you should absolutely absolutely not be throwing values away because you dont like them. Theres no statistical justification for that whatsoever. If youre concerned about extreme values increasing the variance of your estimators then use robust statistics instead - use the median rather than the median, the spearman correlation rather than the pearson, the interquartile range rather than the variance, etc etc.

This seem apropos to our situation because the outliers we are dealing are almost assuredly not random data input errors but part of the highly variable reality of residential customer energy consumption which is the key part of our challenge here. The underlying data process includes these outliers and a whole lot more that we cannot explain or control for. It seems completely arbitrary to trim or winsorize any of these data points out when we know full well that the same dynamics are potentially moving every datapoint in the distribution around in some way.

If the intent is to move us back to a result closer to the median, is there not some better way to do that? And alternatively, can we recognize as I believe Savvy mentioned, that this goes back to the primary justification for a comparison group. The comparison group will make the precision even worse but should address the effect of outliers as well as other drift.

*This link gets offcolor/offensive in places, just fair warning. And I can’t speak to the specific source of the comment though the perspective makes sense to me in our situation https://www.econjobrumors.com/topic/trimming-or-winsorizing

hshaban commented 6 years ago

Comment by gkagnew Tuesday May 16, 2017 at 17:35 GMT

wow, I guess I better figure out how I made that quotation so big and bold, eh?

hshaban commented 6 years ago

Comment by tplagge Tuesday May 16, 2017 at 18:33 GMT

I do think that a compelling case can be made for using robust statistics--median in particular. It's more like, if we do decide to use the mean, it's important to either winsorize or throw out outliers--just because people are weird and unpredictable, and so by random chance, a sufficiently large portfolio will inevitably pick up the odd house where the homeowner installs a swimming pool or decides to go off the grid during the first week of the reporting period.

So why would one use the winsorized or outlier-removed mean as opposed to the median? Let's assume arguendo that there are two classes of efficiency projects in the world: (1) Slam dunks, where the "real" savings will be 50%. (2) Incremental wins, where the "real" savings will be 10%. And the second class is much more common, and that there are no true outliers (no pools, no unabombers).

So each aggregator is selecting a few projects from pool (1), where the estimated savings will be a normal distribution centered around 50% with width 10%, and a lot of projects from pool (2), where the estimated savings will be a distribution centered around 10% with width 10%.

Let's then consider Aggregator A, which performs 9000 projects from pool 1 and 1000 projects from pool 2. The mean fractional savings we would estimate for this aggregator would be 14.2%, which--since in our hypothetical there are no true outliers--is the correct answer. The winsorized mean (5% threshold) would be 14.0%, the outlier-removed mean (5% threshold) would be 13.1%, and the median would be 11.5%.

In other words, using the median would give the aggregator only 80% of the credit they deserved, the mean with outliers removed would give them 92% of the credit they deserved, and the winsorized mean would give them 98.6%.

What I'm arguing is that there's a tradeoff between having a good measure of centrality on one hand, and crediting aggregators for identifying particularly successful projects on the other. Winsorization is a bit wishy-washy from a theoretical perspective, for the reasons your quote points out, but it does seem to be a fairly attractive middle ground from a practical one...

hshaban commented 6 years ago

Comment by gkagnew Tuesday May 16, 2017 at 19:09 GMT

Interestingly, the extreme example you had to go to - an unrealistic, imbalanced bimodal distribution with no outliers, strikes me as an indication of how hard one has to work to find examples where the median fails so badly. Also comparing to winsorizing in an example with no outliers would seem to downplay the potential shortfalls of winsorizing. What about distributions closer to what we are dealing where one or the other tail is heavier? It seems to me, winsorizing could move the resulting mean estimate quite a ways around either side of an un-moving median with reasonable assumptions.

This does raise an issue, I don't remember hearing or seeing, but the methods we are producing need to be robust down to potentially quite small sample sizes if savings are at the aggregator level. Do we need to address that explicitly? Does that recommend one approach over another?

hshaban commented 6 years ago

Comment by mcgeeyoung Tuesday May 16, 2017 at 20:08 GMT

@gkagnew I think it's important to remember the purpose of this part of the CalTrack spec. We are proposing a set of options for program managers to consider as they aggregate savings and we are empirically testing those options with the CalTrack data to consider the consequences. Whether or not Winsorizing is the best solution or not is immaterial. That's not for us to decide. What Tom has done is show the effects of this procedure on the available data. Other solutions like calculating the IVWM or the median have other advantages and other tradeoffs. Our job is not to have theoretical arguments, but to ground our guidance in the analysis of the data.

If you wanted to take the available 1000 project dataset and propose an alternative aggregation technique and show how it performed better along certain lines, I think the process would benefit from your contribution.

hshaban commented 6 years ago

Comment by tplagge Tuesday May 16, 2017 at 20:51 GMT

My example was a bit contrived, sure, but the underlying point is that whether there are dramatically more positive outliers than negative ones, or whether the distribution is skewed in some other interesting way, has an impact on which aggregation is the fairest to participants. And I'd argue it's the mean that is truest to the spirit of weather normalised metered savings, which is why I'm inclined to at least present it in some relatively robust form as an alternative.

That all having been said, I'm more than happy with offering up the median first and highlighting the fact that it is a robust estimate of centrality.

hshaban commented 6 years ago

Comment by gkagnew Tuesday May 16, 2017 at 22:26 GMT

@McGee Theoretical, eh? Them’s fighting words . . . ;^)

Just to clarify here, DNV GL’s primary role here is to make sure that the process, data driven or otherwise, stays consistent with the bigger picture. As we have said all too often, there are fundamental challenges with pre-post delta as a proxy for savings. Those challenges are on display in the outliers and throughout the distribution that we are trying to ring a result from. I want to be sure that we are thinking clearly, in the larger context, about the implications of shedding or down-weighting outliers. Let’s also not lose sight of the fact that all of our testing points to a median 2.6% increase in consumption that would represent a ~10 to 25% hit on the savings estimates for the test years we are looking at.

I was not in fact, being theoretical in my comment, but asking what our options are, for instance, if we go with the median. Are there acceptable ways to get some sort of variance measure from a median? Alternatively, do we go back to Blake’s recommendation in #63 regarding a site-level criteria that would remove problematic sites? You seemed to leave that open as an option. Perhaps there is something along the line of the traditional minimum Rsquare cut-off that removes likely problematic sites throughout the distribution. If I am understanding correctly, we get a sense of what effect that might have with Tom’s examples of winsorizing the weights on the IVWM tests. Finally, much of this becomes moot if a comparison group is in the mix.

hshaban commented 6 years ago

Comment by mcgeeyoung Tuesday May 16, 2017 at 22:54 GMT

Indeed, getting to a point where we can connect site-based normalized metered savings to aggregated Payable savings is going to be the sweet spot for this effort. I would defer to the EM&V experts on whether site-based savings are how you want to calculate Net/Claimable. Obviously, comparison groups and panel regressions seem to be in vogue now. But our work here, providing guidance for how to calculate Payable savings, is different than getting to Claimable/Net (the P4P EM&V plan seems to point in the right direction for evaluating Net). Rather, we need to provide guidance on Payable that will make sense for the program and the aggregator.

In that vein, one of our main considerations should be uncertainty. Whatever the aggregation rule that gets used, it needs to be specific and the results of the rule application need to be predictable. If an aggregator is looking at their own portfolio, they need to be able to perform this same technique and get pretty close to the same answer. A rule that reduces uncertainty for the aggregator, increases their confidence in making investments in the program.

hshaban commented 6 years ago

Comment by matthewgee Wednesday May 17, 2017 at 17:06 GMT

I've been thinking about this great discussion a lot over the last couple of days and there are several issues that we're bundling together that are worth unpacking and addressing one by one.

The objective of this issue is to arrive at guidance for program implementers on how payments are determined using aggregations of site-level weather normalized savings. The simplest form of aggregation would be to total all site-level savings, multiply by the price, and that's the payment. Equivalent to that would be to take the unweighted mean

We've enumerated a number of issues that raw totals or unweighted means that make that a risky choice for program implementers. Those issues are:

Treatment of Outliers. Outliers (negative or positive) that may be a result of statistical noise or data quality can lead to extreme swings in mean & total savings for a given portfolio and make it incredibly risky for a program implementor to ignore. Dealing with outliers in the aggregation rules should be a core recommendation.
Treatment of Negative Savings Values. Aggregating site-level weather normalized gross savings alone may not fully account for the value of savings to the grid or for meeting policy goals because for both grid and policy benefit stakeholders often care about change in use above or below what was expected. For P4P program, implementors have the new option of signaling the full value to the grid by adjusted (using control groups or otherwise) the price paid for a given quantity based on sub-population-level expectations. However, price-based value adjustments don't work for households with negative savings. Therefore, aggregation rules should, at the minimum, have guidance for how to treat negative savings.

Does that capture the two main challenges we've been discussing, or am I missing something? (please add below)

Given the issues above, we've been discussing and testing out several potential solutions. Those potential solution sets we are deciding among are:

1) To winsorize or not to winsorize? That is the question. Given the effects of outliers on portfolio-level savings statistics, whether truncation, winsorizing or leaving everything in and letting the chips fall where they may is the best recommendation to program implementers in determining payable savings for a portfolio.

2) Alternative robust statistics for centrality. Given the effects of outliers, should we recommend an alternative robust statistic (median, weighted mean) instead of unweighted mean to determine payable portfolio-level savings?

3) To adjust or not to adjust? That is the question. Given the likelihood of negative savings and that some of these negative savers are above sub-population conditional expectation, should we recommend negative savings be treated on their face, throw out, or adjusted using comparison groups to potentially make some subset positive? (The last option, of course, implies adjustment to be applied to all sites, and not just those with negative savings).

Do those sound right as the set of options we are working through for addressing the above challenges in aggregation and payment recommendations?

I'm now going to go through one by one summarizing the evidence to date, but I want to make sure we're clearly enumerating the issues and what needs to be decided so we can address them in sequence.

hshaban commented 6 years ago

Comment by matthewgee Wednesday May 17, 2017 at 17:31 GMT

To winsorize or not to winsorize? That is the question

I think @gkagnew's concerns about winsorizing make a lot of sense. Here's a very good breakdown and empirical testing of the effects of truncation, winsorising, or keeping outliers on simple difference in means estimation in that was put together by a friend of mine on Facebook's data science team.

The main takeaways from John's analysis that I think generalize to our case are that, unsurprisingly, there is no method of outlier treatment universally beats out the option of doing no adjustment for outliers. Choosing a good window for outlier removal can lead to an optimal reduction in RMSE on point estimation, but choosing a bad window can lead to worse performance and leaving everything in. The best way to chose that window is to know something about the ground truth of the parameter you are trying to infer, which, in out case, isn't exactly feasible.

In the analysis above it's clear that not doing anything with outliers AND choosing an unweighted mean as the relevant statistic for payable is a very risky proposition.

All this makes me inclined to go down the robust statistic approach rather than tempt fate by somewhat arbitrarily choosing an outlier window in our guidance. I think this echo's both @tplagge and @gkagnew above points. So let's talk through that. Next up, robust statistics.

hshaban commented 6 years ago

Comment by mcgeeyoung Sunday May 21, 2017 at 06:12 GMT

Following up on the discussions of the working group, the final recommendation for aggregation is as follows. The program administrator should establish a 50%+/- (from zero) cutoff for "normal" site level savings. Any site where the savings exceeds that threshold would be eligible for appeal and potentially excluded from aggregation. The basis for exclusion would be established as part of the terms of program participation.

This recommendation is made under the assumption that site-based savings will be summed across the portfolio.