Proposal for outlier detection and removal

houghb commented 7 years ago

On the call last week we acknowledged significant outliers in the testing dataset and these outliers are making our output statistics less useful.

We explored some different ways to identify outliers during the model selection process (where we train our models on one year of pre-treatment data, then test the model performance on a second year of pre-treatment data). The outlier detection approach we are proposing can also be used in the final specs to remove outliers from weather normalized savings estimates.

Here are the different approaches that we considered with some notes:

An outlier is a premise where the annual usage changes >30% from the training pre-treatment year to the testing pre-treatment year
- This works at identifying the outliers in the model selection process, but it is not readily extensible to the weather normalized savings so we discarded it.
An outlier is a premise where the absolute value of the fractional savings is greater than 0.75
- Fractional savings is the same as in our output metrics: (predicted_daily_use.sum() - daily_use.sum()) / daily_use.sum()
- In this case premises whose annual usage changes more than 75% from what it is predicted to be are thrown out as bad estimates. For the model selection process this makes sense because the only difference between the two years of data should be weather, and weather alone should not change usage by 75%. For the weather normalized savings this cutoff value also makes sense because we don't expect to see real savings of 75% after a project.
- With a cutoff value of 0.75 this drops less than 5% of premises from the electric or gas results. Using a cutoff value of 1.0 works alright too and drops only 3%, but leaves some significant outliers.
An outlier is a premise with a fractional savings value in the top or bottom X percentile of the results
- This approach works fine, but doesn't do quite as well at moving the median and means closer together as approach No. 2.
- We would probably want to tune the percent based on what the actual results look like for each different set of results.
- X = 5% drops 10% of premises from the electric results
  X = 2% drops 4% of premises from the electric results
  X = 1% drops 2% of premises from the electric results
An outlier is a premise with fractional savings more than X standard deviations away from the median
- For X = 1.5 and X = 2 this only drops a single premise so doesn't meet our needs.

Recommendation

The two viable approaches to outlier detection we explored (No. 2 and No. 3 above) each attempt to do different things:

Approach No. 3 requires a distribution of results to determine what the outliers are, and it is possible in this scenario for a premise to be dropped as an outlier when it is part of one subsample of the available premises, but if you re-run the analysis with a larger/smaller/different subset of premises it might not be an outlier anymore.

In contrast, approach No. 2 can be applied at the premise level and determines for each premise whether our estimate is "good" or not based on the logic that we should not see more than 75% savings. Something identified as a poor estimate under this approach will always be discarded, even if additional premises are run.

Essentially these two approaches are for different things -- No. 3 identifies true outliers in some distribution, No. 2 identifies bad estimates. We are recommending No. 2 because it hits several birds with the same stone: getting rid of outliers, but also filtering out bad models before aggregation.

To make sure I'm clear, I am proposing that we add the following to the analysis spec: _"Remove premises for which the absolute value of the fractional savings is greater than 0.75, where fractional savings is defined as (total_annual_predicted_use - total_annual_actual_use) / total_annual_actual_use"_

Output comparison for electric (before and after removing outliers)

ModelID	Model Description	Climate Zone	Number of Sites in Group	Mean daily use (training)	SD in daily use (training)	Mean daily use (testing)	SD in daily use (testing)	Mean heating balance point	Mean cooling balance point	Mean CVRMSE	Mean NMBE	Median CVRMSE	Median NMBE
1	Current spec with outliers removed	16	2	33.36304658	20.45473949	32.44178836	20.48904894	59	69.5	29.66660401	-9.393668691	29.66660401	-9.393668691
1	Current spec with outliers removed	2	79	15.81895657	12.04616068	15.65663614	10.45600452	59.22727273	67.08108108	40.52787925	-0.0624178	36.05715704	1.591909755
1	Current spec with outliers removed	3	186	14.37042834	9.424762735	14.17563148	9.413943469	60.28025478	67.11111111	44.18369412	-0.117802197	36.85247073	0.309592477
1	Current spec with outliers removed	4	82	18.44725684	11.58791841	18.27547939	11.19745294	59.80882353	67.35087719	36.78670718	-0.040641884	34.7236992	1.007863796
1	Current spec with outliers removed	5	16	17.73247106	8.286043853	17.0151938	8.461434828	61.84615385	68	26.76734401	-4.350611209	24.07013201	-3.00739748
1	Current spec with outliers removed	11	52	30.80566099	19.62299251	30.11455837	19.676095	60.07894737	69.36538462	36.31790551	-1.684012074	33.02752803	-3.808143451
1	Current spec with outliers removed	12	360	23.57038212	16.18791723	23.20346066	16.08652957	58.316	67.721875	42.11296887	-2.193839622	37.36518505	-1.667744641
1	Current spec with outliers removed	13	118	33.07237532	22.30305605	32.33749649	21.5743705	60.65217391	71.1440678	36.15120211	-4.359162633	30.97849501	-3.436827338

ModelID	Model Description	Climate Zone	Number of Sites in Group	Mean daily use (training)	SD in daily use (training)	Mean daily use (testing)	SD in daily use (testing)	Mean heating balance point	Mean cooling balance point	Mean CVRMSE	Mean NMBE	Median CVRMSE	Median NMBE
1	Current spec	16	2	33.36304658	20.45473949	32.44178836	20.48904894	59	69.5	29.66660401	-9.393668691	29.66660401	-9.393668691
1	Current spec	2	91	16.77561635	13.50253338	15.31266112	10.56994922	59.24	67.04878049	46.25520309	-6.323619551	38.5635312	-1.530476594
1	Current spec	3	216	14.5414926	10.11924684	13.49545309	9.541706383	60.27683616	66.92	166.5966534	-123.4805673	38.42499052	-1.225046386
1	Current spec	4	90	18.26733008	11.60227117	17.66383191	11.21901263	60.06756757	67.40983607	39.49459982	-0.663687944	35.27508383	1.007863796
1	Current spec	5	17	17.4686972	8.197565129	16.70343116	8.416398823	62.07142857	69.16666667	26.76734401	-4.350611209	24.07013201	-3.00739748
1	Current spec	11	59	30.70871162	20.80840078	30.67601254	23.4510035	60.325	69.33333333	40.38655729	-6.11977188	33.12688587	-4.3478681
1	Current spec	12	389	23.49813289	16.10049864	22.68430358	16.10464327	58.3129771	67.64117647	46.43881721	-6.711320706	37.64979899	-1.8570027
1	Current spec	13	136	32.08717042	22.1922064	30.54500668	21.9339118	60.81372549	71.18796992	44.78832609	-10.33473553	32.19374289	-3.517148987

mcgeeyoung commented 7 years ago

@houghb This is a really interesting writeup. I wonder if you could briefly enumerate your proposed use cases for this approach (#2). Are you suggesting that we use it to cull our 1,000 meter dataset? Or are you suggesting a broader application of this technique?

houghb commented 7 years ago

I am proposing that we remove the premises identified above any time we use our models to make predictions. I don't think we should cull the 1000 home dataset, but before reporting any summary or output statistics we should remove these premises from the set of results. We would do the same when generating weather normalized savings estimates before we enter the aggregation steps.

mcgeeyoung commented 7 years ago

@houghb I would feel more comfortable putting this in as a recommendation under the Aggregation section. In both your proposal above and in our aggregation recommendations, we are providing guidance as to how to deal with the effects of outliers, or non-standard distributions. However, we shouldn't not report (i.e., censor) the outputs. But rather we should bring attention to them and suggest good methods for handling them (as in above).

mcgeeyoung commented 7 years ago

Closing this now that a final recommendation has been made on aggregation.

impactlab / caltrack