openelections / openelections-core

Core repo for election results data acquisition, transformation and output.
MIT License
176 stars 96 forks source link

How to handle county-level absentee/provisional results in precinct-level file? #209

Closed dwillis closed 10 years ago

dwillis commented 10 years ago

Some of the North Carolina files have precinct-level results and then, in the same file, county-wide absentee or provisional totals for a candidate. In those cases, the "precinct" is "ABSENTEE" or "PROV". Leaving those in place seems wrong. I thought of creating a county-level RawResult object, but struggling on where to indicate that it represents absentee or provisional ballots and not regular votes. Any thoughts?

ghing commented 10 years ago

Do they also provide county wide vote totals for the candidate? If so, creating a county-level result and putting the absentee and provisional votes in the vote_breakdowns field seems like the way to go. If not, I'll have to think about this one a bit more.

dwillis commented 10 years ago

They don't provide countywide totals, no.

ghing commented 10 years ago

Ugh. Can you point me at an example file so I can get a better sense of this?

dwillis commented 10 years ago

Sure thing - first line in the text file contained here: ftp://alt.ncsbe.gov/enrs/priprecinct11xx07xx2000.zip

ghing commented 10 years ago

@dwillis Got it. Thanks. I'll have to think about this a little bit. Lots of options, none of them great.

Off the top of my head:

dwillis commented 10 years ago

Yeah, I had considered the first option, but it seems like it obscures things.

zstumgoren commented 10 years ago

@ghing @dwillis What about deferring the loading of these data points until we're past the RawResult stage? Since we only have precinct-level data (and not county-level data except for these provis/absentee totals), it would make sense to only load the most granular level of results. Then, when working on the "normalized" or "clean" loader phase, you could simply add the provis/absentee totals to the to vote total for rolled-up county-level results. This would likely be a transform step.

dwillis commented 10 years ago

I could see that, but wouldn't that involve re-loading the data in the transform step? Or is there another way we could defer the loading? Maybe we could load them and then deal with them in the transform as @zstumgoren suggests. I think I'd be ok with having them in RawResult, as they do mirror what's in the data, but we'd probably need some way to alert people to their existence.

ghing commented 10 years ago

@zstumgoren I think the problem is that there's not a place to stash these values so that they'll be available to later stages.

For this particular value, it is the most granular data, albeit at a different reporting level than other rows in the source data.

My understanding of the transform (the clean load), and how I implemented it for MD, is that it only operates from RawResults. I don't think we should change this.

The way I think about this is the general case of a row containing one and only one vote value, and that value doesn't represent something other than total votes for the candidate in that jurisdiction.

zstumgoren commented 10 years ago

@ghing I think it's fine (and in fact desirable in this case) to use transforms in this way. It avoids us loading data points that might otherwise be confusing outside the context of rolled-up county-level results.

In terms of implementation, the transform would be straight-forward:

zstumgoren commented 10 years ago

@ghing Also, with respect to transforms, don't forget that these are intended to be flexible and don't always have to operate directly on a RawResult. In fact, when I first implemented these, there were no RawResult records! Even now, it would be normal to first migrate RawResult into a normalized set of models, and then apply a series of transforms on those "clean" or "normalized" records in a step-wise fashion. In theory, only the first transform would operate on RawResult records. This distinction gets blurred when numerous (or all) transforms are applied in a single pass, but it's fine to decouple transforms so that they're applied in a step-wise fashion. Anyhow, point is, I think we can flexible in how we apply transforms.

That'll give us a little more wiggle-room to solve problems like this. That said, I'm open to alternative solutions in the case at hand.

ghing commented 10 years ago

@zstumgoren I agree that we can be flexible in how we apply transforms, and with Maryland I definitely load from RawResult and then clean the Result models further.

However, I think loading a file feels like a really separate process from transforming data, and I don't know if it's any less confusing than expanding our data model a little bit to capture records like these.

zstumgoren commented 10 years ago

@ghing Sure, I'm not opposed to tweaking our models once again. While I have a clear sense of what the transform-based approach would look like, I'm less clear on the impacts of the model-update strategy.

It feels like it would be more involved and require back-porting previously loaded states, but perhaps we can minimize that effort. Also, we'd need to make sure that these records don't get baked out at the RawResult layer, since it would be confusing/inaccurate to offer partial county results at the raw stage.

You're more familiar with the bakery at this point, so let us know what complications, if any, you foresee with the ETL process and the potential impacts on other states.

ghing commented 10 years ago

@zstumgoren, I don't think there would be model backporting.

I had similar concerns about the bakery. I agree that we might not want to bake out these records. I'll look into what's involved with filtering things out of the baking process, though this is probably something that we want to support anyway as I imagine there will be other cases where we'll need to do this.

zstumgoren commented 10 years ago

@ghing Cool. One other point we should clarify: What would the Result level look like in terms of rolling-up county-level results from precinct? I'm guessing we'd:

But we wouldn't migrate the RawResult records for prov/absentee to the clean Result layer, correct?

dwillis commented 10 years ago

IIRC, Maryland precinct results don't include absentee/provisional results at all, but county-level totals do, so this is definitely something that will come up.

zstumgoren commented 10 years ago

@dwills Yep, this inconsistency is what had me thinking that rather than creating brand new records to represent this data point, we'd want to note prov/absentee totals, when available, as a subtotal on a given jurisdiction's RawResult or Result record. So if we have them for precincts, we note them on precinct-level records; for county, on the county-level records (whether or not we've rolled those up), etc.

But perhaps this doesn't work cleanly across the board. For example, VA has provisional/absentee counts at precinct-level as separate rows. So if we're trying to map closely to the source data, it would make sense to load those as separate RawResult records and then aggregate them up downstream. Otherwise, the RawResult loader would be performing somewhat of a transform step in the sense that it would have to sprinkle on the the absentee totals to appropriate precincts.

Perhaps we need consensus first on how to handle this situation, before deciding on the NC stuff?

dwillis commented 10 years ago

Yeah, I think that's a good idea.

ghing commented 10 years ago

Just realized some Iowa results have this too. For example:

Attorney General
TOM MILLER OverVote UnderVote Scattering
ABSENTEE PRECINCT 524 0 0 0 0 0 0 216 0
PROVISIONAL PRECINCT 0 0 0 0 0 0 0 0 0
ADAIR COMMUNITY CENTRE 369 0 0 0 0 0 0 109 4
STUART RECREATIONAL CENTER 351 0 0 0 0 0 0 113 0
FONTANELLE COMMUNITY BUILDING 371 0 0 0 0 0 0 121 0
ORIENT UNITED METHODIST CHURCH 297 0 0 0 0 0 0 107 2
YMCA OF ADAIR COUNTY 386 0 0 0 0 0 0 131 1
Totals 2298 0 0 0 0 0 0 797 7

Although in this case, there is a county total, so it makes more sense to combine the first two and final row into one result.

ghing commented 10 years ago

Closing this in favor of #211.