Standardize precinct identifier format

nvkelso commented 6 years ago

Should generally be state fips (AA) & county fips (AAA) & precinct id (AAAAAAA*).

Sometimes there is both a precinct name and ID, perhaps we should include both variants? (Though extra columns inflates the DBF).

migurski commented 6 years ago

Those precinct IDs come from the Census, but only in cases where a state participated in the 2010 VTD program right?

nvkelso commented 6 years ago

We invent them for state and local sources. We should be more consistent there...

And there should be crosswalk with other precinct data provider / sources.

On Aug 1, 2018, at 11:41, Michal Migurski notifications@github.com wrote:

Those precinct IDs come from the Census, but only in cases where a state participated in the 2010 VTD program right?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

migurski commented 6 years ago

For PlanScore, I’ve been assigning them artisinal integers. Works really well internally but not something I’ve exposed generally.

nvkelso commented 6 years ago

Please make them public!

On Aug 1, 2018, at 12:51, Michal Migurski notifications@github.com wrote:

For PlanScore, I’ve been assigning them artisinal integers. Works really well internally but not something I’ve exposed generally.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

nvkelso commented 6 years ago

In https://github.com/nvkelso/election-geodata/pull/146: All state IDs are now FIPS codes in https://github.com/nvkelso/election-geodata/pull/146, and there's a common field format (2 char for state, 32 char for county (which should be ssCCC but some data comes as longer name strings and that's not normalized yet), and 255 char precinct (should be normalized, but same as county).

sigpwned commented 6 years ago

First, let me say how thrilled I was when I came across this project. Because it contains preinct-level geodata for the whole country, I think it can be the hub for any GIS or map election data project. I know it's a great starting off point for some work I plan to do!

Regarding precincts, I reviewed precinct labels from a number of states and I was disappointed to find that there is little shared rhyme or reason among them. Some use numeric codes; some use physical location names, like "city hall"; some use a combination of the two; others seem not to include labels at all. If the goal is to standardize precinct labels in a way more general than "uppercase, split on non-alphanumeric and join with single whitespace," this project will have to come up with its own novel naming scheme. I'm not sure there is a "right" answer, on face.

However, I think we can optimize the labeling for some common use cases. The work I plan to do involves joining this data set to other precinct-level data sets, e.g. data sets from here. I think a good way of standardizing the labels would be:

Look for other other precinct-level data sets that are available
Study how they label their precincts
Choose a method that makes joining to as many different data sets as easy as possible

For example, let's say we find 10 such data sets. It's likely they'll all be at least a little different. But if we find that they all use place names to identify precincts, then we'd want to make sure to preserve place names when they're available in this data set. Because all the data sets will be different we won't find any scheme that's perfect, but we can at least find some objective measure for "better."

I also think that what @nvkelso about data crosswalks is really important. If this data set is going to become a hub, then it needs to be as easy for other people to pick up and use for their own purposes as possible. To that point, I think that encouraging people to publish any crosswalks they create would be A Good Thing. (For example, when I do the join to the data sets linked above, I'll be happy to share a "join table" that maps this data set to those data sets.) Those joins make this data more useful; the joined data more useful; and any data that joins to either can now be mapped to both.

Here are some data sets that I think it could be useful to review when trying to decide on a standard. I'm sure there are others, but hopefully these are a good start:

Just a couple of thoughts I had while elsewhere in the data set, for whatever they're worth. Hopefully they make sense.

nvkelso commented 6 years ago

Hi @sigpwned, thanks for your kind words and thoughtful comments. I really like the idea of x-walk concordance "join" tables with other precinct datasets.

I've been wondering if this project should allow both precinct "identifier" and precinct "name" columns when both those are available in the upstream sources to make this a little easier.

sigpwned commented 6 years ago

I've been wondering if this project should allow both precinct "identifier" and precinct "name" columns when both those are available in the upstream sources to make this a little easier.

That's an interesting idea! And it's knocked some ideas loose for me. Let me try to dump my brain while the thoughts are fresh.

Based on my understanding, the goal of this project is:

Every US precinct is represented by one record in the dataset with a unique (state_fips, county_fips, precinct_id) key.

Here are a few thoughts on getting there:

All records should now have state FIPS codes.
All records with counties should now have county FIPS codes, or will soon, per #135.
All records without counties should receive county labels soon, per #135.
There are duplicate (state_fips, county_fips, precinct_id) keys in the data set.
Some records have no precinct_id.

4 and 5 above are potentially significant issues.

Regarding 4, it's difficult to know if these "duplicate" rows represent one precinct with the region split into multiple geometries, or if the rows are actually mislabeled. The only way I can think of to make that determination is to compare this data to other precinct-level data. Once we know that:

If the rows represent one precinct, then I recommend we merge the duplicates into one row having the ST_Union of their respective geometries.
If the rows are mislabeled, then I recommend we change the precinct_id labels to make them unique, e.g. by appending A, B, C, and so on.

Regarding 5, it's much like 4, except that all precincts should be treated as having the same label. Teasing these apart into "real" labels is going to be fairly manual work, unfortunately. We probably can't cheat by comparing to a "known good" precinct data set because if that data set existed, presumably we'd be using that instead of the data we have. At the very least, we should be able to use this map or one like it to do the assignments.

We're free to assign any IDs to updated rows we like. I think it would be wise to make those IDs look as much like other precinct ID labels as possible, but the reality is that new IDs are completely at our discretion. Any crosswalks we publish are essentially a relabel anyway, so users can substitute new labels if they wish.

Regarding keeping two precinct_id columns, I think it's a fine idea, but ultimately users will have to pick one column for any work they do. Fundamentally, it would be our first crosswalk, so we can publish that separately if we want to, or leave it integrated into the data set as a separate column. They're basically the same thing.

In any case, I think the plan of attack here should be to finish out #135 since we're close, and then generate a report sizing up 4 and 5 above, per state. We won't really know how much work this step will be until we have that report.

Just my two cents. How does that seem to everyone else?

sigpwned commented 5 years ago

Once we have #135 closed and the $state_fips$county_fips vs $county_fips format standardized, I see this issue as the next "big thing." Any thoughts on the above? With the benefit of more thought, I'm more confident that trying to standardize the precinct values is probably not useful, because they're so different.

Here's where I think we are:

[x] All records should now have state FIPS codes.
[x] All records with counties should now have county FIPS codes, or will soon, per #135.
[x] All records without counties should receive county labels soon, per #135.
[ ] There are duplicate (state_fips, county_fips, precinct_id) keys in the data set.
[ ] Some records have no precinct_id.

I think fixing the last two above are top priority. I'm not sure what the best way to approach that is, per the above, but I haven't though too hard on it yet either.

Once this is handled however is deemed best, I think the next priority would be building crosswalks everywhere. How easy that is will probably depend on how we do this work.

Again, just my two cents. What does everyone else think?

nvkelso / election-geodata

Standardize precinct identifier format #144