suggested improvement to daily roll-up method

impactlab / caltrack

Shared repository for documentation and testing of CalTRACK methods

http://docs.caltrack.org

6 stars 5 forks source link

suggested improvement to daily roll-up method #62

Closed houghb closed 7 years ago

houghb commented 7 years ago

Right now the data prep spec directs users to roll up AMI data (of any sub-daily frequency) to daily totals by summing every value provided for that day. We are not imposing any data sufficiency checks (to make sure there are a certain number of hours with data on a given day).

Since we aren't checking for data sufficiency I suggest that we report the daily total as the average hourly usage for that day * 24. In the case where there is not any missing data this will exactly equal the sum of the usage, but when some hours are missing it will lead to more stable and realistic day-over-day values.

matthewgee commented 7 years ago

@houghb I agree with this roll-up definition for daily totals from hourly or sub-hourly reads. +1

That said, I think we should also add a data sufficiency criteria for the number of intervals as a proportion of all intervals during a day that are used in a day total rollup to be included in the usage dataset.

As a starting point, how about 50%?

This is a little tricky because a strict cutoff without a distribution requirement there are cases that make more/less sense to throw out/include. For example, I would be more inclined to say and daily usage total that comes from average hourly use for a day where every other hour is missing is more reasonable to include than a day where only the first 12 hours of the day are available and the second half are missing.

Perhaps we can do a quick check of "typical" missing hour patterns in the test data to know whether missing values are sequential or somewhat random.

houghb commented 7 years ago

I am all for adding data sufficiency criteria and 50% sounds like a fine starting point.

In an informal exploration of the CalTRACK data a couple months ago (not the 1000 home subset) I noticed that for some weather stations there are contiguous blocks of missing hours (often ~midnight to sometime in the morning), but don't know how frequently this is the case (vs somewhat random missing hours).

I think we should assess whether it is worth spending time doing that check on missing hour patterns after the phone meeting tomorrow since we are getting close to the deadline to finish these specs. Updating the roll-up definition and adding a 50% data sufficiency requirement is already a big improvement over our existing spec so it might be acceptable to leave it there and move on with other necessary work.

houghb commented 7 years ago

As mentioned on the call today, when checking for data sufficiency as @matthewgee suggested, I propose we make it explicit that an electricity value of 0.0 should count as missing (so count against data sufficiency). For gas we should allow values of 0.0

houghb commented 7 years ago

I've updated the data prep spec with this improvement, so am closing this issue.