Second pass at project summary stats

marcrecurve commented 7 years ago

I rearranged my implementation to try to get closer to Energy Savvy's data prep stats. This gets the "Number of unique project SAIDs" to match with Energy Savvy.

However, now all of the other stats are slightly different.

This makes me confident we're excluding the same projects. I believe the difference is that ES computed their stats before fully filtering the project dataset.

I included the pseudocode for my data prep process (data_clean_steps_oee.md).

houghb commented 7 years ago

We're computing all of our statistics on the final clean/filtered version of the projects table... I suspect the difference could be from this step in our data prep process (see data_cleaning_steps_energysavvy.md):

Check for duplicate SAIDs in xref tables:
- There are no duplicates in the gas table
- There are 10 duplicates in the electric table; 5 SAIDs, each of which corresponds to 2 SPIDs
- For each of these pairs of SPIDs, one of them doesn't correspond to 60 min usage data so remove that row (all 5 SAIDs remain in the table).

In that step we remove SPIDs from the cross reference table that correspond to 15 minute usage data instead of 60 minute usage data (this isn't indicated, we had to examine all the duplicates to see if they were in the 15 or 60 minute data sets). Those SPIDs are all for duplicate entries of SAIDs.

In your data cleaning procedure when you assign SPIDs to projects you chose to keep the first SPID for a duplicate SAID, which doesn't necessarily mean that we removed the same SPIDs from our cross reference tables. Perhaps this is the source of the different statistics?

marcrecurve commented 7 years ago

That could be possible @houghb I'll try to add that additional xref filtering step and see what shakes out.

Doesn't explain the difference in Total number of records. My count there is before any filtering at all. I did this one before filtering since it would be the same as # of unique SAIDs after filtering.

Pretty sure your zip codes aren't sorted either, which would explain why our Bottom 10 are different.

houghb commented 7 years ago

If I calculate the total number of records before doing any cleaning I get the same number as you (4358), but that number includes 466 duplicate electric SAIDs (as well as a couple hundred duplicate gas SAIDs), so it will not be the same as the number of unique SAIDs after filtering.

You're right on the zip code sorting, that was addressed in the issues after I had generated my statistics and I haven't updated it yet.

marcrecurve commented 7 years ago

I agree with what you just wrote, @houghb. I meant that on the fully cleaned project dataset the total number of records equals the number of unique SAIDs since duplicate IDs are merged as part of cleaning. Am I wrong somehow with that idea?

I'm pretty sure we're just defining "merged" and "cleaned" slightly differently wrt that particular summary stat. We end up at the same # of SAIDs eventually.

houghb commented 7 years ago

I see what you're saying - in my case the total number of records differs from the number of Electric SAIDs because my final project table includes both electric and gas accounts - the extra total number of records corresponds to the number of gas accounts for which there is no electric SAID. I generated our projects table this way (with both electric and gas) before Matt decided we could split them into two project tables, so you should not expect your statistic for that field to agree with ours.

I just updated our summary report with the zip codes sorted properly and they agree with yours.

marcrecurve commented 7 years ago

Nice, yeah, that could explain it. I ignore gas completely in these summary stats.

impactlab / caltrack

Second pass at project summary stats #21