CalTRACK Issue: [Monthly Fixed Effects]

AndrewYRoyal commented 5 years ago

Prerequisites

[x] Are you opening this issue in the correct repository (caltrack/grid/seat)?
[x] Did you perform a search of previous Github issues to check if this issue has been previously considered?

Article reference number in CalTRACK documentation (optional): [3.11, 3.75]

Description

I’d like to propose using month-of-year fixed effects to control for seasonal variation, which could substitute for the current method of using monthly regressions (see 3.11 & 3.7.5). The fixed-effects model is written as follows: where month_m(t) a fixed adjustment estimated for month m. A few clarifying points/questions:

It is common in the research literature to use fixed-effects models to adjust for seasonal (monthly) variation in energy consumption. The fixed-effects approach, therefore, deserves Caltrack’s attention. The following three peer-reviewed articles (totaling 2,000+ citations) evaluated program savings by applying fixed effects regressions to metered consumption: Allcott 2011, Ito 2014 and Jessoe, Rapson 2014. And the list goes on.
My own tests using public metered data show that a fixed effect model outperforms the monthly WLS model when evaluated along standard metrics (R2 and CVRMSE). See more below or code here.
Do the reporting statistics for the current monthly WLS model adjust for degrees of freedom? The monthly WLS model essentially interacts each of the regression coefficients with month-of-year (i.e., it estimates separate coefficients for each month). So, a time-of-week model estimated for each month over an entire year will have a total of 168*12 = 2,016 free parameters. Adding 7 temperature bins increases the free parameter count to 2,100. It could therefore be potentially misleading to report metrics such as R2 without adjusting for the monthly prediction model’s total quantity of free parameters. The fixed effects model, in contrast, has only 168 + 12 + 7 = 187 free parameters.

Proposed test methodology

The test is conceptually straightforward—we ought to test whether the current monthly WLS model specified in the hourly methods section outperforms the proposed fixed effects model. The metrics can be the following: • R2 (within-sample) • CVRMSE (within-sample) • R2 (out-of-sample-sample) • CVRMSE (out-of-sample-sample) Ideally the tests would also adjust for the disparity in the degrees of freedom—but I don’t think that will be an issue with the out-of-sample tests.

My own test: I’ve gone ahead and tested the models on my own using a public dataset containing one year of hourly metered data for a sample of 507 meters (from a college campus). The code, found here, can be replicated as-is on any computer with R installed (along with data.table, RCurl and ggplot2 packages). It runs on my PC in about 13 minutes (it’s faster if we demean the data rather than estimating the full set of coefficients—but estimating the full model makes the code more intuitive). The training (testing) sample in this exercise was the odd (even) weeks of the year. Here are the within-sample comparisons of the monthly WLS (WLS) and fixed-effects (FE) R-squared:

And out-of-sample:

These measurements do not adjust for model degrees of freedom, but they nevertheless strongly favor the fixed effects model. The CV measures (see code) similarly favor the FE model.

Acceptance Criteria

Since I have already tested the proposal, maybe the best way to move forward is to review the current tests and attempt replication. I submit that the method should be accepted if it is confirmed that the findings I’ve reported above are internally valid (e.g. no code error or model misrepresentation) and/or externally valid (the comparison is robust to alternative meter datasets). The same conditions should apply to a "corrected test" if the one above is found to be flawed.

(edit: just noticed the github code link isn't working-- also try this one: https://www.dropbox.com/s/1niqfdpecyf7ikf/methodCompare.html?dl=0)

Thank you for your consideration

Andrew Royal, Res-Intel

mcgeeyoung commented 5 years ago

Really interesting. I hope we can spend some time talking about this one!

EthanGoldman commented 5 years ago

It seems like the underlying phenomenon that we would be capturing with the monthly fixed-effect model is that buildings operate differently at different times of the year. Perhaps the heating and cooling systems are turned on and off once per year, or some other equipment (non-weather dependent) is only run during certain portions of the year. Additionally, there may be changes in the operations or occupancy that is consistently seasonal. The different modes of performance do not necessarily fall exactly on calendar month boundaries, however.

To select the appropriate portions of the year on which to apply fixed effects, our goal is to add the fewest degrees of freedom while minimizing the model error. Note that this method, in particular, should be tested against out-of-sample data. Some possible approaches to consider:

First, generate a monthly fixed-effect model, then perform a clustering analysis to try grouping adjacent months into a single term; optimize CVRMSE for 1-12 periods. This could end up creating summer/non-summer periods, four seasonal periods of differing lengths, a single separate month (indicating seasonal shut-down or crunch time around the holidays, for example), or simply determine that the monthly terms don't improve the model and removing them. This could also be done with weekly terms instead of monthly.
Start by generating a standard TOWT model and calculating daily residuals. Automatically find breakpoints in the residuals (those transitions where one day is more statistically similar to the previous period and the following day is more similar to the next period). Rank the breakpoints by the strength of the division and iteratively test whether the CVRMSE improves when adding fixed-effect terms for each subset of the year defined by the breakpoints. Note that Dec-Jan (or wherever the year starts/ends) should be joined into a single period unless there is actually a strong breakpoint there.
For either of the previous methods (but particularly the second one, since it's much higher-resolution in the way it determines breakpoints) consider using weather-based transition points rather than calendar-based transition points. This is different than the binning technique used in TOWT, since we are looking for contiguous periods of the year to apply fixed effects to, not all days or hours in that bin. This would effectively create periods defined by something like "when the temperature consistently stays above/below X for at least Y days" or using a rolling average. It might make sense to explicitly define a maximum of four periods for this technique since it's explicitly looking for the beginning and end of heating and cooling seasons. This would be less effective at capturing other season variations like summer schedules for schools, de/commissioning of pools or other seasonal-use equipment besides HVAC.

steevschmidt commented 5 years ago

To expand on Ethan's example of what we're trying to analyze:

Perhaps the heating and cooling systems are turned on and off once per year...

In residential buildings we see thermostats changed in and out of heating or cooling modes at different times each year dependent on the weather: heating mode may be turned off anytime March thru June, and turned back on anytime during the Fall. Additionally, many homes employ supplemental heating and cooling from space heaters, fans, and room AC units during winter or summer peaks.

None of these changes are tied to specific calendar months, and their timing will vary from baseline period to reporting period. Any counterfactual model that assumes these actions will occur at the same time every year will introduce errors in the calculation of savings.

jkoliner commented 5 years ago

@steevschmidt The last point you make there is debatable. Behavioral changes are necessarily new responses to similar circumstances, so a counterfactual that tries to model those new responses would fail to accurately measure energy changes. If we ran pre and post period models (as I know you've suggested elsewhere), we could get around that but we do need to carefully handle any occupancy or behavioral factors in those models. I'm not saying that months or time periods of year are a perfect proxy for expected equivalent behavior in the absence of treatment, but I don't think it's quite the flaw you suggest.

AndrewYRoyal commented 5 years ago

@EthanGoldman Thanks for the detailed suggestions. I can investigate the second option with some of the BuildingGenome data and have the results posted in a few days. Clustering on residuals from the non-seasonal model seems like a good idea in theory, but I'm worried that the clusters (or switch-points) might track non-routine events or other artifacts rather than seasonal effects. We'll see. I like the idea of using temperature to inform the clustering... that ties the model down a bit so we can be more certain that the bins are tracking seasonal changes rather than picking up some unexplained variance in the residuals.

AndrewYRoyal commented 5 years ago

I took the suggestion of testing “clustered” FE and ran some tests on a subset of the data genome dataset, including only dorm rooms. I figure dorms are the closest to residential the dataset offers.

Here is an interactive dashboard that details the findings: http://res-intel.info:3838/bg-caltrack/ (be patient w/ loading times)

A few things I noticed

The fixed effects that use consumption/residual clusters perform the best on CVRMSE. However, the clusters don’t necessarily map onto seasonal changes (see the “Cluster” panel on the dashboard)—and I’m afraid they may just be overfitting non-routine fluctuations.
The monthly FE model fits the data closer (lower CVRMSE) than the monthly WLS. It actually looks like the ‘overfitting’ worry might run in the opposite direction—the monthly WLS models offer more conservative predictions (closer to the mean).

The clustering was performed after using dynamic-time-warping (DTW) comparisons on daily load/residual/temperature paths. Its possible to use other distance metrics, such as Euclidian, but I think DTW does pretty well.

jkoliner commented 5 years ago

I don't know if that is overfitting, Andrew. Do you have this dorm's school schedule? I see some indication that the residuals-based model is fitting to periods of different occupancy. I see summer school, maybe some short term breaks and winter break, beginning and end of summer, and the standard school period...

AndrewYRoyal commented 5 years ago

True. I don't have any schedule info and maybe you are right that it is premature to conclude overfitting. I just noticed that the 4 clusters often didn't break perfectly into seasonal blocks... but they come close, so maybe that's not a big deal

steevschmidt commented 5 years ago

Nice dashboard Andrew!

I personally don't think dorm rooms are a good proxy for homes, so I hope Hassan is able to run some analysis using residential data.

Also: @jkoliner and I discussed our prior comments in more detail offline and [I think] we agreed on the following:

In general “baking in” behavior (e.g. building characteristics changing a specific way in a specific calendar month) is a bad idea, unless there's clear evidence or a solid argument the behavior is consistent over time.
A good test might be to use building data spanning at least two years absent interventions, and test to see if the monthly terms allow the baseline model to beat a different model in the out of sample testing. Years with differing weather patterns would be important for such a test.

jkoliner commented 5 years ago

Steve has largely represented our consensus correctly. To add on: For point 1, I think that the dorm data will give us overly optimistic out-of-sample results because dorm schedules are fixed. For homes, the timing of summer vacations, family visits, and other vacant periods may be less regular. With a multi-year (or multi-site) baseline period that might be mitigated substantially, but a single year, single-site baseline with clustering will fit fixed effects to the irregularity. In general, data from large buildings should follow a more predictable usage pattern and is a better candidate for fixed effects.

hshaban commented 5 years ago

Sharing some preliminary results and recommendations:

Dataset: One year of AMI data from a sample of approximately 500 residential buildings without any known energy efficiency interventions. A set of commercial buildings from the EVO Testing Portal was also used.

Test procedure: The default Caltrack 2.0 hourly method with the three-month weighted baseline segmentation was then applied to the dataset. A second set of models was also fit to the data but using a single baseline. In both cases, the baseline was broken up into blocks of training data (70%) and test/out-of-sample data (30%).

Results:

Using out-of-sample CVRMSE as the error metric, the fixed effects model performed slightly better than the segmented baseline approach for the residential dataset and slightly worse for the commercial dataset.
As expected, model overfitting (difference between in-sample CVRMSE and out-of-sample CVRMSE) is reduced using the fixed effects approach due to the smaller number of independent variables.
The fixed effects approach generally shows more hourly bias in the out-of-sample tests (a classic example of exchanging better precision for a more biased model).

Recommendations:

Using the fixed effects formulation yields some mixed results. Overfitting appears to be reduced. Relative out-of-sample performance depends on the dataset. Hourly predictions (which are the primary use case for the CalTRACK hourly methods) are more biased.
The fixed effects approach is an interesting simplification of the hourly methods, but requires more investigation using different datasets. Also, the model formulation does not correspond to the physical changes in residential buildings that alter the weather-to-energy correlation through the course of the year. We are recommending keeping the current approach pending more research on the hourly methods.

To apply the fixed effects version, this code segment was modified in https://github.com/openeemeter/eemeter/blob/master/eemeter/caltrack/hourly.py#L243, and used with “single” segmentation type instead of “three_month_weighted”:

def _get_hourly_model_formula(data):
        if (np.sum(data.loc[data.weight > 0].occupancy) == 0) or (
            np.sum(data.loc[data.weight > 0].occupancy)
            == len(data.loc[data.weight > 0].occupancy)
        ):
            bin_occupancy_interactions = "".join(
                [" + {}".format(c) for c in data.columns if "bin" in c]
            )
            return "meter_value ~ C(hour_of_week) + C(month) - 1{}".format(
                bin_occupancy_interactions
            )
        else:
            bin_occupancy_interactions = "".join(
                [" + {}:C(occupancy)".format(c) for c in data.columns if "bin" in c]
            )
            return "meter_value ~ C(hour_of_week) + C(month) - 1 {}".format(
                bin_occupancy_interactions
            )

Out of sample comparison for a set of commercial buildings (EVO Testing portal)

Out of sample CVRMSE comparison for a set of residential buildings

Average hourly bias in different months (out of sample results)

AndrewYRoyal commented 5 years ago

Thanks Hassan. It looks like the CalTrack 2.0 model does pretty well on EVO-- it even outperforms the gradient boost model we posted!

A few initial comments/questions:

The disparity between the hourly and CVRMSE findings might be explained by the fact that one is reporting a normalized metric and the other is not. The CVRMSE normalizes the error rate by average metered consumption, so each building/meter receives equal weight when you evaluate the median/average of this metric. The hourly plots, however, are just taking the average of the raw residuals, so buildings that consume more overall have a greater impact on the averages.
One way to resolve the disparity is to evaluate residuals as a percentage of actual consumption at each building-hour and then take the averages. I think that we should do something like this this before concluding that the FE model is worse at predicting daily loadshapes.
Is the "one-month" method fitting a flexible monthly intercept model or is it fitting a different model for each month?
I think my previous version of the WLS model may have been a bit inaccurate-- it was my understanding that the model applied 0.5 weights to ALL other months, but it looks like non-zero weights are actually only applied to the two neighboring months. Is that correct?

hshaban commented 5 years ago

Calculated the median error for each building at each hour in each month. Results posted below.
The fixed effects implementation uses a single model with a categorical/factor independent variable for the calendar month
Correct

jkoliner commented 5 years ago

The current recommendation is not to recommend fixed effects to the Steering Committee at this time. If a member for the working group disagrees, they should log their dissent in a comment below.

philngo-recurve commented 1 year ago

Closing stale issue in preparation for new working group

openeemeter / caltrack