Precinct names - Githubissues

karpet commented 7 years ago

Precinct names are a mess.

There's official census.gov names from 2012 via https://github.com/nvkelso/election-geodata/blob/master/data/20-kansas/statewide/2012/kansas-state-voting-precincts-2012.geojson but those do not correlate for names as reported via county clerk data in openelections.

I think we need a combination of:

automatic mapping based on some common heuristics
manual mapping, maybe with a precinct_aliases lookup table.

karpet commented 7 years ago

Census names are in https://github.com/statedemocrats/ks-votes/blob/master/db/kansas-2012-precinct-names.csv.gz

karpet commented 7 years ago

Here's an example for Sheridan county.

ksvotes=# select name from precincts where county_id =90 order by name;
         name         
----------------------
 Adell
 Adell Township
 Bloomfield
 Bloomfield Township
 Bow Creek Township
 Bowcreek
 East Kenneth
 East Saline
 Hoxie - East
 Hoxie - West
 Kenneth - East
 Kenneth - West
 Logan
 Logan Township
 Parnell
 Parnell Township
 Prairie Dog
 Prairie Dog Township
 Saline, East
 Saline, West
 Sheridan
 Sheridan Township
 Solomon
 Solomon Township
 Springbrook
 Springbrook Township
 Union
 Union Township
 Valley
 Valley Township
 West Kenneth
 West Saline
(32 rows)

ksvotes=# select name from census_precincts where county_id = 90 order by name;
         name         
----------------------
 Adell Township
 Bloomfield Township
 Bow Creek Township
 East Kenneth
 East Saline
 Logan Township
 Parnell Township
 Prairie Dog Township
 Sheridan Township
 Solomon Township
 Springbrook Township
 Union Township
 Valley Township
 West Kenneth
 West Saline
(15 rows)

mikegaughan commented 7 years ago

Okay I think I'm tracking. Sheridan county data in General Election Tally 11-08-16 has "Adell" but the Census precincts list with FIPS code as "Adell Township" and your solution is to build a table that makes those equal in the eyes of the database.

mikegaughan commented 7 years ago

Combined Precinct Data.xlsx

This is a file we did last cycle to get Votebuilder precincts matched up with 2012 election results, and votebuilder precincts are derived from the state voter file precinct fields.

mikegaughan commented 7 years ago

Sheridan County is even more cloudy in the voter file: Precinct AD3, Precinct BL4, etc

mikegaughan commented 7 years ago

Kenneth Township, for example, is both the Township and the city of Hoxie within. Votes are only reported at township level, but people are registered in city of Hoxie for, presumabley, local elections which those who live outside of the city limits (but still in Kenneth Twp) cannot participate in. And then it's divided between East and West for further fun.

karpet commented 7 years ago

Yes, exactly.

I called and talked to the Sheridan county clerk this afternoon, just to clarify for myself what I was seeing in the spreadsheet they distribute vs what's in the SOS spreadsheet(s). Hoxie (East|West) is local elections only, but gets reported in OpenElections for state leg races as well. It depends on where/how OpenElections is parsing its data.

I have constructed the data model now to accommodate this by having a CensusPrecinct map to one or more Precinct records, so that votes are tabulated (summed) according to census tracts ("official" SOS precincts) even if they are reported at "unofficial" levels like Hoxie. This gives us a more apples-to-apples comparison with census and voter file data.

Thanks for the combined precinct data xls - I will use that to help build a mapping from the voter file as well.

I want to standardize on a single authoritative name for each census tract, so I'm going to keep using the census tract name and mapping everything to that from voter file, election results, etc.

karpet commented 7 years ago

I've started manually mapping names that require it, here: https://github.com/statedemocrats/ks-votes/blob/master/db/douglas-county-precincts-2016.csv

Generally, I'm using the 2010 Census name as authoritative, since the Census names work for 4 out of 10 years, and then in the intervening years counties may re-parcel based on their own logic.

karpet commented 7 years ago

I've spent several days on this now and want to document the working solution.

When the database is bootstrapped, before any results are entered, the 2012 (2010) Census Tracts are used to seed the Precincts table. So the baseline is one CensusTract == one Precinct.

Then we immediately start to add aliases, and precinct-to-censustract mappings, based on the changes counties made after 2012.

There are 2 tables for allowing for the mess that is precinct names:

The PrecinctAliases table is used to map alternate names for the same precinct. So for example, Wyandotte County Kansas City Ward 1 Precinct 01 might be reported in an election result as KC 1-1. The KC 1-1 is added as a PrecinctAlias. Another example is where a Kansas precinct was subdivided to allow for the wonky Legislative boundaries. So, we alias KC 9-16 to Kansas City Ward 11 Precinct 13 S5 (via CensusTract 120080).
The CensusPrecincts table is used to create many-to-many relationship between precincts and census tracts. For example, if in 2016 a single precinct was carved into multiple precincts, so that each new precinct relates to a single census tract, CensusPrecinctrecords are created for each relationship. A more complex example would be if 2 (or more) CensusTract records were split into more than 2 Precincts so that a single Precinct might be split across multiple tracts, then CensusPrecincts can be created to reflect the multiple relationships.

While PrecinctAlias records help normalize results to a single precinct known by multiple names, the CensusPrecinct mostly just flags a set of results as "weird" so that we know that comparing the results across multiple years/elections is probably not an apples-for-apples comparison. Particularly for urban areas with rapidly changing development/population, comparison across multiple years will require manual inspection and some finesse.

I'm currently focused on Shawnee, Sedgwick, Douglas, Johnson and Wyandotte counties for mapping tracts to precincts, since those are the most populous counties and have had the most boundary churn.

karpet commented 7 years ago

Sedgwick county has proven the most troublesome in terms of name re-use and correlation. I have made several different approaches to detecting matches and still have about 80 or so from 2012 that I cannot determine w/o more manual work. The master file is here: https://github.com/statedemocrats/ks-votes/blob/master/db/sedgwick-county-precincts-2016.csv and the rows that need attention are those with empty 2nd and 3rd columns.

karpet commented 7 years ago

Having looked primarily at 2012 and 2016 before this week, I started parsing 2014 results this week. These are the most fraught because we do not have quality SOS sources for 2014, but are instead relying on what openelections has (which seems to be mostly from counties directly).

So 2014 is a mess across the board.

I have started using a couple of Ruby gems (libraries) called fuzzy_match and fuzzy_tools to start applying some natural language processing techniques to the precinct names. Between the 2 of them I have managed to cut the number of "orphans" (precinct names with no known corresponding census tract) by 50% for 2014. That still leaves hundreds of names to manually curate though, so I may need to figure out some way to crowdsource that, given how few hours I have to spend on this. As before, I am starting with the 5 most populous counties first.

Here's an example from Sedgwick county: W 15 P 02 & W15 P04 -> Topeka Ward 15 Precinct 02

karpet commented 7 years ago

I was able to find federal results for 2014 from the SOS site and use those to at least map VTD and precinct names for that year, which helps somewhat.

Shawnee and Sedgwick counties continue to be the most problematic in terms of name re-use since the census. I've managed to zap most of the Shawnee issues. Will turn my attention to Sedgwick et al next week.

Also this week started on some reporting for all this data, so we can start to visualize some of the results on the map. This will take some design thinking since there is so much data.

karpet commented 7 years ago

screen shot 2017-10-30 at 5 30 17 pm An example of the results visualization I got done today.

statedemocrats / ks-votes

Precinct names #5