Closed dwillis closed 10 years ago
Do they also provide county wide vote totals for the candidate? If so, creating a county-level result and putting the absentee and provisional votes in the vote_breakdowns field seems like the way to go. If not, I'll have to think about this one a bit more.
They don't provide countywide totals, no.
Ugh. Can you point me at an example file so I can get a better sense of this?
Sure thing - first line in the text file contained here: ftp://alt.ncsbe.gov/enrs/priprecinct11xx07xx2000.zip
@dwillis Got it. Thanks. I'll have to think about this a little bit. Lots of options, none of them great.
Off the top of my head:
vote_breakdowns
vote_type
(or better name) field that is a code (i.e. from a defined set of choices) for what the votes
field represents. Add documentation that says null vote_type
value means total votes for the candidate in that jurisdiction. Probably a good idea to also put the value in vote_breakdowns
to reenforce what's going on here.Yeah, I had considered the first option, but it seems like it obscures things.
@ghing @dwillis What about deferring the loading of these data points until we're past the RawResult
stage? Since we only have precinct-level data (and not county-level data except for these provis/absentee totals), it would make sense to only load the most granular level of results. Then, when working on the "normalized" or "clean" loader phase, you could simply add the provis/absentee totals to the to vote total for rolled-up county-level results. This would likely be a transform step.
I could see that, but wouldn't that involve re-loading the data in the transform step? Or is there another way we could defer the loading? Maybe we could load them and then deal with them in the transform as @zstumgoren suggests. I think I'd be ok with having them in RawResult
, as they do mirror what's in the data, but we'd probably need some way to alert people to their existence.
@zstumgoren I think the problem is that there's not a place to stash these values so that they'll be available to later stages.
For this particular value, it is the most granular data, albeit at a different reporting level than other rows in the source data.
My understanding of the transform (the clean load), and how I implemented it for MD, is that it only operates from RawResults. I don't think we should change this.
The way I think about this is the general case of a row containing one and only one vote value, and that value doesn't represent something other than total votes for the candidate in that jurisdiction.
@ghing I think it's fine (and in fact desirable in this case) to use transforms in this way. It avoids us loading data points that might otherwise be confusing outside the context of rolled-up county-level results.
In terms of implementation, the transform would be straight-forward:
@ghing Also, with respect to transforms, don't forget that these are intended to be flexible and don't always have to operate directly on a RawResult
. In fact, when I first implemented these, there were no RawResult
records! Even now, it would be normal to first migrate RawResult
into a normalized set of models, and then apply a series of transforms on those "clean" or "normalized" records in a step-wise fashion. In theory, only the first transform would operate on RawResult
records. This distinction gets blurred when numerous (or all) transforms are applied in a single pass, but it's fine to decouple transforms so that they're applied in a step-wise fashion. Anyhow, point is, I think we can flexible in how we apply transforms.
That'll give us a little more wiggle-room to solve problems like this. That said, I'm open to alternative solutions in the case at hand.
@zstumgoren I agree that we can be flexible in how we apply transforms, and with Maryland I definitely load from RawResult and then clean the Result models further.
However, I think loading a file feels like a really separate process from transforming data, and I don't know if it's any less confusing than expanding our data model a little bit to capture records like these.
@ghing Sure, I'm not opposed to tweaking our models once again. While I have a clear sense of what the transform-based approach would look like, I'm less clear on the impacts of the model-update strategy.
It feels like it would be more involved and require back-porting previously loaded states, but perhaps we can minimize that effort. Also, we'd need to make sure that these records don't get baked out at the RawResult
layer, since it would be confusing/inaccurate to offer partial county results at the raw stage.
You're more familiar with the bakery at this point, so let us know what complications, if any, you foresee with the ETL process and the potential impacts on other states.
@zstumgoren, I don't think there would be model backporting.
I had similar concerns about the bakery. I agree that we might not want to bake out these records. I'll look into what's involved with filtering things out of the baking process, though this is probably something that we want to support anyway as I imagine there will be other cases where we'll need to do this.
@ghing Cool. One other point we should clarify: What would the Result
level look like in terms of rolling-up county-level results from precinct? I'm guessing we'd:
RawResult
records and add their vote totals to the rolled-up records for countiesBut we wouldn't migrate the RawResult
records for prov/absentee to the clean Result
layer, correct?
IIRC, Maryland precinct results don't include absentee/provisional results at all, but county-level totals do, so this is definitely something that will come up.
@dwills Yep, this inconsistency is what had me thinking that rather than creating brand new records to represent this data point, we'd want to note prov/absentee totals, when available, as a subtotal on a given jurisdiction's RawResult
or Result
record. So if we have them for precincts, we note them on precinct-level records; for county, on the county-level records (whether or not we've rolled those up), etc.
But perhaps this doesn't work cleanly across the board. For example, VA has provisional/absentee counts at precinct-level as separate rows. So if we're trying to map closely to the source data, it would make sense to load those as separate RawResult
records and then aggregate them up downstream. Otherwise, the RawResult
loader would be performing somewhat of a transform step in the sense that it would have to sprinkle on the the absentee totals to appropriate precincts.
Perhaps we need consensus first on how to handle this situation, before deciding on the NC stuff?
Yeah, I think that's a good idea.
Just realized some Iowa results have this too. For example:
Attorney General | |||||||||
---|---|---|---|---|---|---|---|---|---|
TOM MILLER | OverVote | UnderVote | Scattering | ||||||
ABSENTEE PRECINCT | 524 | 0 | 0 | 0 | 0 | 0 | 0 | 216 | 0 |
PROVISIONAL PRECINCT | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
ADAIR COMMUNITY CENTRE | 369 | 0 | 0 | 0 | 0 | 0 | 0 | 109 | 4 |
STUART RECREATIONAL CENTER | 351 | 0 | 0 | 0 | 0 | 0 | 0 | 113 | 0 |
FONTANELLE COMMUNITY BUILDING | 371 | 0 | 0 | 0 | 0 | 0 | 0 | 121 | 0 |
ORIENT UNITED METHODIST CHURCH | 297 | 0 | 0 | 0 | 0 | 0 | 0 | 107 | 2 |
YMCA OF ADAIR COUNTY | 386 | 0 | 0 | 0 | 0 | 0 | 0 | 131 | 1 |
Totals | 2298 | 0 | 0 | 0 | 0 | 0 | 0 | 797 | 7 |
Although in this case, there is a county total, so it makes more sense to combine the first two and final row into one result.
Closing this in favor of #211.
Some of the North Carolina files have precinct-level results and then, in the same file, county-wide absentee or provisional totals for a candidate. In those cases, the "precinct" is "ABSENTEE" or "PROV". Leaving those in place seems wrong. I thought of creating a county-level
RawResult
object, but struggling on where to indicate that it represents absentee or provisional ballots and not regular votes. Any thoughts?