Proposal: separate the method specifications of what we've actually done from recommended data cleaning methods for monthly data

houghb commented 7 years ago

Something confusing that has come up recently in other threads (Issue #35, Pull Request #38) is the decision to publish a set of data cleaning methods for monthly billing data. The reason this is confusing is because as beta testers, we have not spent time or effort looking at or cleaning monthly billing data. Instead, our approach has been to roll up hourly AMI data into calendar months, and use that for our monthly analysis - the existing Pull Request #38 tries to arrive at some kind of mashup of the steps we actually took with rolling up hourly data and the steps we think might be necessary for monthly billing data.

There are a host of new, and as yet unexplored, challenges that might arise when cleaning and using actual monthly billing data and the current plan is to rely on all of our prior experience working with this type of data to come up with a "reasonable" set of method specifications for handling it. There is no plan to test our proposed methods among the different beta testers. If any future user actually wants to use use real monthly billing data then we will need to significantly revise the current monthly analysis specifications (for example the specs currently rely heavily on all the data being aggregated by calendar month, even though it is uncommon in the monthly billing data for usage to be grouped by calendar month).

For the reasons above, it's not clear to me why we're proposing publishing specifications for cleaning monthly billing data (because we haven't actually done it, and because even if you were to clean it you can't use that type of usage data with our existing monthly analysis specifications unless you make a bunch of undocumented adjustments to the specifications). The main argument for doing this seems to be because we want to get something published by the end of this month, and we don't have the time for beta testers to test and compare...

I propose that we separate the actual work that has been done from the guidance on using monthly billing data. What this would look like would be one set of method specs (data cleaning and monthly analysis) for the work we've done so far (rolling up hourly AMI usage data to calendar months and using this for modeling savings), and then another set of guidance documents for our recommendations on how someone might do this using monthly billing data instead (making it clear that we haven't actually tested those recommended steps). Note that the first set of method specs proposed would be our existing specifications (with just minor updating if anyone finds errors or things that were left out).

mcgeeyoung commented 7 years ago

This is great thinking @houghb and parallels my concerns as well. Let me take a crack at breaking the spec up a bit to reflect this distinction and I'll share as a PR

houghb commented 7 years ago

@mcgeeyoung Based on the google doc it sounds like we've decided to do this separation. If we're doing it for data cleaning it seems like it also needs to be done for monthly analysis? There are a lot of revisions needed to the monthly analysis specs for them to work with monthly billing data that doesn't correspond to calendar months (for example we should consider using weighted least squares instead of OLS, data sufficiency needs to be properly re-defined, matching of weather and usage data needs to be revisited, the whole spec needs to be updated so that it does not rely on calendar months for every step but instead operates on variable date ranges, etc.) A user might be able to read our existing spec and come up with a way to do the analysis using monthly billing data, but they would be making tons of independent assumptions and decisions about how to do so that don't meet Leif's requirement for a reproducible set of methods...

tplagge commented 7 years ago

One way to handle this that would be straightforward would be to stick with calendar months as the unit of analysis, in order to keep the billing data pipeline as similar as possible to the AMI data pipeline.

Let's say we have the following billing data: 1/12: 100 kWh 2/12: 200 kWh 3/12: 100 kWh

Start by discarding the first month, since without knowing the start date of the billing period, we can't really use it. Then divide up the 200 kWh equally between the 31 days 1/12 to 2/11, and the 100 kWh between the 28 days 2/12 to 3/11. The usage per day for the calendar month of February is then (200 / 31) (11 / 28) + (100 / 28) (17 / 28).

houghb commented 7 years ago

This might work, but seems like implementation details get tricky when you might have some read periods that are 61 days and others that are 17. Some months in the usage table will then appear to be empty until you analyze the number of days between read dates and calculate some approximation for the value attributable to the empty month. What about situations where the number of days between adjacent read dates doesn't correspond to the number of billed days, what time period do you attribute use to in that situation? The order that steps are taken to convert usage data to calendar months becomes crucial in this instance, and do we determine data sufficiency before or after converting to calendar months? I'm not suggesting that we can't come up with a spec for this, but I think it is a lot more complicated than changing some wording in our existing spec...

tplagge commented 7 years ago

I agree, there are definitely subtleties--maybe "straightforward" was an overstatement. The biggest advantage in my mind is that it isolates "billing vs AMI data" in a preprocessing step, and lets the model fitting and prediction steps remain the same for all input data. Which seems desirable to me, so long as it can be done without overcomplicating things/making unjustifiable assumptions.

houghb commented 7 years ago

Not sure I agree that we want the model fitting and prediction steps to be the same for all input data. We've already discussed that we plan significant revisions to the modeling steps for daily methods (using variable degree days, data sufficiency checks for weather and usage will be different, etc).

Modeling with monthly billing data already involves a lot of approximations; making more approximations to force it to fit calendar months just because our current approach was never intended to use billing data doesn't seem to meet Matt's requirement to "at least meet industry minimum standards for basic monthly gross saving analysis".

If we go this route we will essentially be publishing three sets of specs when this process is done in the summer: (1) Usable specs for daily modeling with hourly AMI usage data (2) Proposed, but untested and likely incomplete, specs for monthly modeling with monthly billing data (3) Usable specs for monthly modeling with hourly AMI usage data (mentioned in passing through footnotes in (2))

matthewgee commented 7 years ago

Okay, so I just did a major overhaul of the monthly data preparation spec working to resolve the above issue. Here are a few things to note in the revision:

Separated out suggested data preparation for monthly data for billing analysis vs tested data preparation of daily usage data. The first thing I tried to do was take Blake and McGees work separating suggestions from tested cleaning procedures to its logical conclusion. We are releasing first a monthly spec, so the data prep spec for v1 should be strictly focused on preparing data for monthly billing analysis. We should assume that anyone who will need to employ monthly billing analysis for the supported CalTRACK use cases is doing it because they don't have access to daily. As a result, all the daily-data-specific requirements and suggestions that came out of our testing don't belong in the v1 data prep doc and can instead be moved to the daily methods data prep spec. This of course raises the issue @houghb brings up which is that monthly data cleaning was never tested as part of the beta. I've added this is the leading caveat to the top of the page. That said, I think these are reasonable data preparation guidelines for monthly billing analysis that, even though untested in this setting, draw on our collective experience doing billing analysis and are largely noncontroversial, even if some are admittedly arbitrary.
Provided an ordering. Since this was a key finding from the beta test that data preparation is not transitive, the data prep should include a suggested ordering. This mean a fundamental restructuring of the document, but I think it actually flows much more logically now. Please double-check that this ordering makes sense.
Resolving some unresolved issues. There were a few of these. Notably, the skipped reading month problem is dealt with in the cumulative billing period guidance, we only deal with daily average weather in data sufficiency for the billing period (no imputation), removed DASMMD extreme value rules because they no longer apply, provided more detailed PV and EV rule for exclusion. Double check that these resolutions make sense and are in keeping with general industry guidance on data preparation for monthly billing analysis and the added policy considerations of the CalTRACK use cases.

houghb commented 7 years ago

@matthewgee I think the document is looking a lot better, thanks for your overhaul. Are you planning to do something similar for the monthly analysis document?

impactlab / caltrack

Proposal: separate the method specifications of what we've actually done from recommended data cleaning methods for monthly data #40