What additional rules or clarifications are necessary for missing and anomalous data in the CalTRACK monthly specification?

matthewgee commented 7 years ago

Currently the data cleaning and preparation specification for monthly billing data and weather (full reference link to specification here), include rules dealing with cases of missing or anomalous data. Some of these rules are specific to the Beta test, others are generalizable.

missing unique IDs in one or more necessary records
missing data in hourly weather files
missing data in daily weather files
missing data in monthly gas and electric usage files
extreme values in usage files
extreme values in project files
miscoded values in project files
data sufficiency requirements for usage
data sufficiency requirements for project data
data sufficiency requirements for weather

houghb commented 7 years ago

I'm not sure how we answer this question (what additional rules are necessary) without making another attempt at having all of the beta testers clean the data themselves, but this time have more granular outputs to see where our results are diverging... The reason we decided to have a single group do the preprocessing/data cleaning was because we were not able to quickly identify the additional rules necessary using our previous approach.

matthewgee commented 7 years ago

@houghb There will almost certainly be additional data cleaning criteria that come out of the methods testing work we all do for the daily methods, but since we discovered a large proportion of the data cleaning criteria that we were discovering were necessary for the monthly methods were a result of our decision to use daily data for a monthly billing analysis, I think we should bookend data a core set of reasonable data cleaning criteria for monthly billing analysis based on prior experience, (i.e. what's missing from the above that we usually do as folks that do a lot of monthly billing analysis) and focus our attention on discovering novel but important cleaning criteria that are added to the above when dealing with daily data.

Does that make sense?

jbackusm commented 7 years ago

Not sure whether this might be a different issue altogether, but it seems to fall under the umbrella of "anomalous data": what should we do about obviously anomalous savings estimates? For example, negative savings that are larger in magnitude that the total pre-treatment usage, or positive savings that are much larger than what might be expected from the EE measures alone? Are there any loose criteria we should apply to protect the savings estimates from very large exogenous changes in usage between pre and post-treatment periods?

houghb commented 7 years ago

@matthewgee Honestly, that doesn't make a lot of sense to me... I understand the desire to wait and re-do the data cleaning when get get to daily methods, and at that time we will need to arrive at a set of methods that anyone can follow and get the same clean data. I also agree that new criteria might come up at that point, but any new criteria that come up will also need to be included (retroactively) in the monthly cleaning methods. What I have trouble with is signing off on any set of V1.0 data cleaning methods for monthly data that have not actually been tested and found to be sufficient. Once we publish V1.0 we are saying that we've published a set of methods for monthly analysis that we think work (even if they can still be improved). We know right now that the current data cleaning and integration methods (accounting for all the missing/anomalous data you outline above) for monthly do not work.

gkagnew commented 7 years ago

@houghb this all comes down to the question of what is good enough for the V1 stopgap. V1 already lacks VDD, comparison groups and daily data, not to mention, still to be determined questions regarding aggregation, dealing with negative savings, etc. DNV GL's concern all along has been that regardless of caveats attached to V.1, it will define some minimum standard, a default as future versions get stuck.

Our provisional support for putting out a V1 as currently specified is only in response to the apparent very real need for something in the short term. Your concern re data rules, with which I agree generally, adds to the list of concerns. My understanding is that an publicizing of V1 will have an extensive set of caveats attached. I assume one of those caveats will state that subsequent versions are required and will be sufficiently different that only limited value may be gained my investing effort into implementing V1.

The alternative to going ahead seems to be re-opening the beta-test process on monthly which no one seems to have the appetite for. What do you suggest as a way forward?

houghb commented 7 years ago

I agree with DNV GL that V1.0 is a minimum standard that is lower than the current standard best practices. My suggestion is either to leave data cleaning and integration out of the V1.0 specs (and state that it will have to come later), or to actually spend some additional time finalizing specs for V1.0 data cleaning and integration that work (which would push back the release date for V1.0, but would involve work that is going to be done anyway as part of daily methods, so doesn't actually add much additional burden). The current data cleaning and integration specs do not work, so publishing them - even with caveats - seems misleading. If there is a "very real need for something in the short term", that suggests the need is for something that works...

houghb commented 7 years ago

Just want to follow up and say that if everyone else disagrees with me on this I'm not going to fight it, but I have doubts that anyone is going to invest time in the near future in going back to make the monthly methods better once we have daily methods, so whatever we publish as V1.0 methods will be the standard for some time to come.

mcgeeyoung commented 7 years ago

@houghb Far from it. Having been the one trying to manage the data cleaning efforts from our end, I can say that your comments are quite apt. Would you be willing to submit a pull request with clarifying language in the spec? I'd be happy to take a look and add any additional insight from my POV. I'll bet we can reconcile most issues that way.

matthewgee commented 7 years ago

@houghb @gkagnew @jbackusm Great discussion. I'm going to move the negative savings discussion to a new issue, because it's an important one. You guys should feel free to create new issues as well when they are not captured well by any of the existing issues. That will help make sure all the concerns and considerations you raise in the discussion above in addition to improvements in data cleaning (which is the focus of this issue) get raised and discussed.

To be clear, I'm not interested in signing off on a v1 monthly standard that doesn't at least meet industry minimum standards for basic monthly gross saving analysis. I don't think anyone else her is either. The point of having these github issues is to identify specific things that need changing, double-check with other stakeholders that those changes make sense through discussion, and then put together and submit a pull request. I don't think we need to spend a lot of time violently agreeing that the current spec is insufficient and that changes need to be made. That's the reason the issue was created in the first place, so let's get specific, identify necessary changes, and make them as @mcgeeyoung suggests through a pull request.

That said, I want to make sure the scope on this issue is absolutely clear since there was some conflating in the above discussion:

v1 Data Cleaning and Integration Rules are for monthly billing data being used in monthly billing analysis. The data cleaning rules should actually be different for monthly data than daily data because monthly billing data is different than daily usage data and the additional data requirements for a monthly billing analysis are different than those of the daily analysis. The focus of this particular issue is getting the data cleaning specification for monthly analysis using monthly data to at least meet minimally acceptable criteria that meets industry standards for basic weather normalized savings estimation. We will then include additional cleaning criteria that are specific to daily data during the v1.1 comment and testing period. Right now, the spec is overinclusive in that includes some daily-specific requirements and I'd like to separate those out in my next pull request.
These should be living standards. The intent for having method versioning with open discussions and ongoing pull requests like what we are doing now is to avoid the very real concern that I share with you guys, which is that things get stuck on a set of methods that are insufficient for their purposes, data, context, or use case. CalTRACK v1 monthly methods should not be set in stone. They should be open to scrutiny, improvements, and updates as problems get notices and new contexts and use cases arise, and there should be future versions. But for v1, we should arrive at a specification that meets industry standards for basic weather-normalized monthly billing analysis, is defensible given the use cases and contexts, but still may have some known and unknown issues to work out or suggested refinements for testing and future versions.

jbackusm commented 7 years ago

@matthewgee Let's also be clear about which type of monthly data we are talking about. You mention monthly billing data, but in fact the monthly analysis we've done so far is AMI usage data rolled up to calendar months. Those two are actually quite different, mainly due to the fact that billing periods do not align with calendar months and tend to vary significantly in length--sometimes an account may get two meter reads in a month for various reasons, or it may go for a few months without a meter read. The inconsistency in billing period length has led us at EnergySavvy to use weighted least squares to fit degree-day models.

Additionally, billing data often comes with special considerations related to data cleaning, such as estimated reads, corrections, etc. Do we want to provide specs on how to address those issues? I would imagine that might be difficult, since those details tend to be different at each utility, essentially a product of their billing system.

houghb commented 7 years ago

@mcgeeyoung, I think it makes sense for someone to submit a pull request to correct the data cleaning and integration spec (however I don't think that should be me, since I did not actually do the data cleaning and integration - I believe this was @marcpare so it's probably best if he submits the pull request as he can be most accurate about what the current steps are). Even if someone submits that pull request it doesn't actually address my concern, which is that we will not have tested the methods outlined in the spec to see if they are reproducible across beta testers...

matthewgee commented 7 years ago

@jbackusm yes, I mean monthly billing data as the data we should have in mind when defining the monthly data cleaning spec. In the several weeks of data cleaning and integration discussion through the technical working group before the start of the beta test, we discuss many of the issues specific to monthly billing data and came to determinations on most of them which are included in the google doc specification. For example, for estimated values, we determined:

Monthly Usage

Missing values where the cumulative value is in the following period, the cumulative number of days between the two periods will be used to generate the UPD for that period
Missing usage values with no cumulative amount in the following period will be counted against data sufficiency requirements
Homes with Net Metering will be dropped from the analysis

Estimated values & deletion

Estimated usage data will be used for estimation, but estimation flags will be added to the post-estimation
Estimated data will enter the same way missing cumulative values do (see above) The issue that @houghb brings up is worth noting again, which is, although we were given monthly billing data as part of the beta test, we made the determination as beta testers to use the daily data for both monthly and daily methods testing since we thought the monthly was a means to an end

What I propose is that I would like to go ahead and make a pull request that reduces the complexity of the current monthly data cleaning spec (which is focused on cleaning daily data for monthly analysis), and instead have @marcpare submit a pull request based on his data preparation that will form the basis of the daily cleaning and integration spec in the v1.1 daily methods.

The issue that @houghb brings up is worth noting again, which is, according to the current schedule, we would not be including a multi-party test of these modified monthly data cleaning specifications before releasing the spec (although OEE will test and push results for the update monthly spec and other beta testers are welcome to). That said, because all of us have done enough billing analysis in enough different contexts, I don't think it's unreasonable to assume that we can develop a set of monthly methods, including the data cleaning and integration specification, that we can feel reasonably confident in without comprehensive multi-party testing on the monthly billing data that we've been given, but decided not to use in the original beta test. AMI data is new and different enough that the additional multi-party testing, I think, is both helpful and necessary.

I think there is a software development analog that might be useful to consider as we do these releases, and that is the notion of test coverage. We want to release the spec with some corresponding notion of the amount of the specification that has been "covered" by testing. The more tests of the specification by different folks, the better the coverage. We would be releasing v1 monthly methods with lower coverage initially than the daily methods will have upon release and we would simply want to make that visible to the user in the same way that TravisCI makes test coverage visible to any visitor to a github repo. Does that make sense?

mcgeeyoung commented 7 years ago

Alright, I took a whack at a pull request based on @marcpare's writeup of his data cleaning process. I removed any new references to the 15 minute and hourly data, which, from what I understand, we'll want to introduce in the V 1.1 process.

impactlab / caltrack

What additional rules or clarifications are necessary for missing and anomalous data in the CalTRACK monthly specification? #35