Closed AndrewYRoyal closed 1 year ago
Really interesting. I hope we can spend some time talking about this one!
It seems like the underlying phenomenon that we would be capturing with the monthly fixed-effect model is that buildings operate differently at different times of the year. Perhaps the heating and cooling systems are turned on and off once per year, or some other equipment (non-weather dependent) is only run during certain portions of the year. Additionally, there may be changes in the operations or occupancy that is consistently seasonal. The different modes of performance do not necessarily fall exactly on calendar month boundaries, however.
To select the appropriate portions of the year on which to apply fixed effects, our goal is to add the fewest degrees of freedom while minimizing the model error. Note that this method, in particular, should be tested against out-of-sample data. Some possible approaches to consider:
First, generate a monthly fixed-effect model, then perform a clustering analysis to try grouping adjacent months into a single term; optimize CVRMSE for 1-12 periods. This could end up creating summer/non-summer periods, four seasonal periods of differing lengths, a single separate month (indicating seasonal shut-down or crunch time around the holidays, for example), or simply determine that the monthly terms don't improve the model and removing them. This could also be done with weekly terms instead of monthly.
Start by generating a standard TOWT model and calculating daily residuals. Automatically find breakpoints in the residuals (those transitions where one day is more statistically similar to the previous period and the following day is more similar to the next period). Rank the breakpoints by the strength of the division and iteratively test whether the CVRMSE improves when adding fixed-effect terms for each subset of the year defined by the breakpoints. Note that Dec-Jan (or wherever the year starts/ends) should be joined into a single period unless there is actually a strong breakpoint there.
For either of the previous methods (but particularly the second one, since it's much higher-resolution in the way it determines breakpoints) consider using weather-based transition points rather than calendar-based transition points. This is different than the binning technique used in TOWT, since we are looking for contiguous periods of the year to apply fixed effects to, not all days or hours in that bin. This would effectively create periods defined by something like "when the temperature consistently stays above/below X for at least Y days" or using a rolling average. It might make sense to explicitly define a maximum of four periods for this technique since it's explicitly looking for the beginning and end of heating and cooling seasons. This would be less effective at capturing other season variations like summer schedules for schools, de/commissioning of pools or other seasonal-use equipment besides HVAC.
To expand on Ethan's example of what we're trying to analyze:
Perhaps the heating and cooling systems are turned on and off once per year...
In residential buildings we see thermostats changed in and out of heating or cooling modes at different times each year dependent on the weather: heating mode may be turned off anytime March thru June, and turned back on anytime during the Fall. Additionally, many homes employ supplemental heating and cooling from space heaters, fans, and room AC units during winter or summer peaks.
None of these changes are tied to specific calendar months, and their timing will vary from baseline period to reporting period. Any counterfactual model that assumes these actions will occur at the same time every year will introduce errors in the calculation of savings.
@steevschmidt The last point you make there is debatable. Behavioral changes are necessarily new responses to similar circumstances, so a counterfactual that tries to model those new responses would fail to accurately measure energy changes. If we ran pre and post period models (as I know you've suggested elsewhere), we could get around that but we do need to carefully handle any occupancy or behavioral factors in those models. I'm not saying that months or time periods of year are a perfect proxy for expected equivalent behavior in the absence of treatment, but I don't think it's quite the flaw you suggest.
@EthanGoldman Thanks for the detailed suggestions. I can investigate the second option with some of the BuildingGenome data and have the results posted in a few days. Clustering on residuals from the non-seasonal model seems like a good idea in theory, but I'm worried that the clusters (or switch-points) might track non-routine events or other artifacts rather than seasonal effects. We'll see. I like the idea of using temperature to inform the clustering... that ties the model down a bit so we can be more certain that the bins are tracking seasonal changes rather than picking up some unexplained variance in the residuals.
I took the suggestion of testing “clustered” FE and ran some tests on a subset of the data genome dataset, including only dorm rooms. I figure dorms are the closest to residential the dataset offers.
Here is an interactive dashboard that details the findings: http://res-intel.info:3838/bg-caltrack/ (be patient w/ loading times)
A few things I noticed
The fixed effects that use consumption/residual clusters perform the best on CVRMSE. However, the clusters don’t necessarily map onto seasonal changes (see the “Cluster” panel on the dashboard)—and I’m afraid they may just be overfitting non-routine fluctuations.
The monthly FE model fits the data closer (lower CVRMSE) than the monthly WLS. It actually looks like the ‘overfitting’ worry might run in the opposite direction—the monthly WLS models offer more conservative predictions (closer to the mean).
The clustering was performed after using dynamic-time-warping (DTW) comparisons on daily load/residual/temperature paths. Its possible to use other distance metrics, such as Euclidian, but I think DTW does pretty well.
I don't know if that is overfitting, Andrew. Do you have this dorm's school schedule? I see some indication that the residuals-based model is fitting to periods of different occupancy. I see summer school, maybe some short term breaks and winter break, beginning and end of summer, and the standard school period...
True. I don't have any schedule info and maybe you are right that it is premature to conclude overfitting. I just noticed that the 4 clusters often didn't break perfectly into seasonal blocks... but they come close, so maybe that's not a big deal
Nice dashboard Andrew!
I personally don't think dorm rooms are a good proxy for homes, so I hope Hassan is able to run some analysis using residential data.
Also: @jkoliner and I discussed our prior comments in more detail offline and [I think] we agreed on the following:
Steve has largely represented our consensus correctly. To add on: For point 1, I think that the dorm data will give us overly optimistic out-of-sample results because dorm schedules are fixed. For homes, the timing of summer vacations, family visits, and other vacant periods may be less regular. With a multi-year (or multi-site) baseline period that might be mitigated substantially, but a single year, single-site baseline with clustering will fit fixed effects to the irregularity. In general, data from large buildings should follow a more predictable usage pattern and is a better candidate for fixed effects.
Sharing some preliminary results and recommendations:
Dataset: One year of AMI data from a sample of approximately 500 residential buildings without any known energy efficiency interventions. A set of commercial buildings from the EVO Testing Portal was also used.
Test procedure: The default Caltrack 2.0 hourly method with the three-month weighted baseline segmentation was then applied to the dataset. A second set of models was also fit to the data but using a single baseline. In both cases, the baseline was broken up into blocks of training data (70%) and test/out-of-sample data (30%).
Results:
Recommendations:
To apply the fixed effects version, this code segment was modified in https://github.com/openeemeter/eemeter/blob/master/eemeter/caltrack/hourly.py#L243, and used with “single” segmentation type instead of “three_month_weighted”:
def _get_hourly_model_formula(data):
if (np.sum(data.loc[data.weight > 0].occupancy) == 0) or (
np.sum(data.loc[data.weight > 0].occupancy)
== len(data.loc[data.weight > 0].occupancy)
):
bin_occupancy_interactions = "".join(
[" + {}".format(c) for c in data.columns if "bin" in c]
)
return "meter_value ~ C(hour_of_week) + C(month) - 1{}".format(
bin_occupancy_interactions
)
else:
bin_occupancy_interactions = "".join(
[" + {}:C(occupancy)".format(c) for c in data.columns if "bin" in c]
)
return "meter_value ~ C(hour_of_week) + C(month) - 1 {}".format(
bin_occupancy_interactions
)
Out of sample comparison for a set of commercial buildings (EVO Testing portal)
Out of sample CVRMSE comparison for a set of residential buildings
Average hourly bias in different months (out of sample results)
Thanks Hassan. It looks like the CalTrack 2.0 model does pretty well on EVO-- it even outperforms the gradient boost model we posted!
A few initial comments/questions:
The disparity between the hourly and CVRMSE findings might be explained by the fact that one is reporting a normalized metric and the other is not. The CVRMSE normalizes the error rate by average metered consumption, so each building/meter receives equal weight when you evaluate the median/average of this metric. The hourly plots, however, are just taking the average of the raw residuals, so buildings that consume more overall have a greater impact on the averages.
One way to resolve the disparity is to evaluate residuals as a percentage of actual consumption at each building-hour and then take the averages. I think that we should do something like this this before concluding that the FE model is worse at predicting daily loadshapes.
Is the "one-month" method fitting a flexible monthly intercept model or is it fitting a different model for each month?
I think my previous version of the WLS model may have been a bit inaccurate-- it was my understanding that the model applied 0.5 weights to ALL other months, but it looks like non-zero weights are actually only applied to the two neighboring months. Is that correct?
The current recommendation is not to recommend fixed effects to the Steering Committee at this time. If a member for the working group disagrees, they should log their dissent in a comment below.
Closing stale issue in preparation for new working group
Prerequisites
Article reference number in CalTRACK documentation (optional): [3.11, 3.75]
Description
I’d like to propose using month-of-year fixed effects to control for seasonal variation, which could substitute for the current method of using monthly regressions (see 3.11 & 3.7.5). The fixed-effects model is written as follows: where month_m(t) a fixed adjustment estimated for month m. A few clarifying points/questions:
Proposed test methodology
The test is conceptually straightforward—we ought to test whether the current monthly WLS model specified in the hourly methods section outperforms the proposed fixed effects model. The metrics can be the following: • R2 (within-sample) • CVRMSE (within-sample) • R2 (out-of-sample-sample) • CVRMSE (out-of-sample-sample) Ideally the tests would also adjust for the disparity in the degrees of freedom—but I don’t think that will be an issue with the out-of-sample tests.
My own test: I’ve gone ahead and tested the models on my own using a public dataset containing one year of hourly metered data for a sample of 507 meters (from a college campus). The code, found here, can be replicated as-is on any computer with R installed (along with data.table, RCurl and ggplot2 packages). It runs on my PC in about 13 minutes (it’s faster if we demean the data rather than estimating the full set of coefficients—but estimating the full model makes the code more intuitive). The training (testing) sample in this exercise was the odd (even) weeks of the year. Here are the within-sample comparisons of the monthly WLS (WLS) and fixed-effects (FE) R-squared:
And out-of-sample:
These measurements do not adjust for model degrees of freedom, but they nevertheless strongly favor the fixed effects model. The CV measures (see code) similarly favor the FE model.
Acceptance Criteria
Since I have already tested the proposal, maybe the best way to move forward is to review the current tests and attempt replication. I submit that the method should be accepted if it is confirmed that the findings I’ve reported above are internally valid (e.g. no code error or model misrepresentation) and/or externally valid (the comparison is robust to alternative meter datasets). The same conditions should apply to a "corrected test" if the one above is found to be flawed.
(edit: just noticed the github code link isn't working-- also try this one: https://www.dropbox.com/s/1niqfdpecyf7ikf/methodCompare.html?dl=0)
Thank you for your consideration