Follow-Up After 11/23 MTV Meeting

jgy4 commented 2 years ago

Hi @nickeubank @adrianefresh

We had a discussion after our meeting with you all and Vaughan to sort through priorities, tasks, and our final report. We'd love your feedback on these!

Tasks From Today (i.e. requests from Vaughan)

[x] Run the early voting data through the Google API to obtain centroid distances and travel times @PranavM98
[x] Send Vaughan a list of schools in AZ, TX, GA, NC, FL with polling places over a mile away (MSI, Public) @jgy4
[x] Add on/off campus differentiation and school counts to Tableau, send screenshots @dapoade

Tasks for Nov. 30th Meeting & Dec. 3rd Final Report

[ ] 2020 Election Day Data comparison (sjoin nearest) Safegraph vs. CPI vs. Ballot Ready @jgy4
[ ] 2020 Early Voting VIP vs. Ballot Ready comparison @jgy4
[x] Look into null/missing distances for 2020 Election Day and 2020 EV @PranavM98
[x] Look into null values for Institution Type @dapoade
[x] Schedule Meeting with +1 Team @dapoade
[x] Adding region labels to '30_campuses_w_dist_to_nearest_pp.geojson' @PranavM98
[x] Finding data/drafting an urban/rural differentiation @jgy4

Proposed Final Report Outline

Introduction
1. Context of our project
2. Purpose of our project
Data - For each section we will discuss the source, cleaning, issues/challenges, resolutions, and next steps.
1. Campus Polygons
2. Polling Places
3. Early Voting
Analysis - For each section we will discuss cuts by MSI Type, Public/Private, 2 Year/4 Year, State/Region, and Urban/Rural.
1. 2020 Early Voting
2. 2020 Election Day
Conclusion

nickeubank commented 2 years ago

GREAT! Comments in the morning.

nickeubank commented 2 years ago

As for report outline, I like it a lot.

As for centroid times, I wouldn't fixate on those. It's one of the things where we know we're gonna update our data, and said there's no reason to invest time into running all that code until we made the changes we know we're going to make to the input data.

I think the MOST important thing is pinning down the polling data we want to use. I was thinking about this last night, and I think we should follow up with Vaughan on just getting on the phone with the ballot ready data person to ask her take on which data to trust. my HOPE is we can just use theirs, but I think hearing it from the horses mouth is going to be the best strategy.

Basically I don't trust safegraph much -- they're in the business of generating really good estimates of things like foot traffic, but small errors aren't really a problem for that type of analysis in the way missing a couple polling places can be a problem for this type of analysis. Maybe they did do a good job, but we aren't sure, and our validation analyses haven't gotten us an answer yet,

I DO trust CPI, but I also know that they only have one person working on this and they don't have all 50 states.

Ballot Ready and VIP or organizations that are basically dedicated to collating and providing this data to people (e.g. I think VIP runs the google and Facebook "where's your polling place" Election Day tools). Adriane and I have tried to approach them in the past to get their data and they've always said no so getting it is really exciting and we may just be able to use their data and not worry about anything else.

While we wait on that, I'd prioritize:

figure out the NAs and the bad states
integrate and merge the ballot ready data (pretty easy, given have coordinates already)
get a feel MAYBE integrate the VIP data
write some analysis code -- e.g. code that generates the tables you had in your PowerPoint -- but SYSTEMATICALLY (e.g. creates a data frame that looks like that table we can always rerun to update results without human intervention.

That make sense?

nickeubank commented 2 years ago

Also re: urban rural differences:

As you do analysis, always write your code bearing in mind that we're gonna keep changing the data going into the analyses, so never "just analyze" the data -- write code to generate tables so we can just update the data being read in, run it, and get updated results!

PranavM98 commented 2 years ago

Regarding the Missing Values in the 2020 Early Distances: Interestingly, the google api was able to compute the distances between the college and polling places for the values that were missed. However, the distances by walk, car, and transit are different (slightly). Wondering if we can impute the missing distances with the distances calculated by the google api?

PranavM98 commented 2 years ago

The google distance api also has 56 missing values and here are the list missing value (state):

The good part is that except for NJ, the other states are in "Other" region. We do not require these states/territories for our analysis.

dapoade commented 2 years ago

I'd been trying to do some digging with Pranav about the null values for early voting distances but you can see Pranav was able to address most of it using the Google API. @nickeubank I know you prefer GeoPandas, so do you think it'd be worth continuing that analyses / do you have any additional documentation for sjoin_nearest

nickeubank commented 2 years ago

https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoDataFrame.sjoin_nearest.html#geopandas.GeoDataFrame.sjoin_nearest

The walking distances are distance to the nearest polling place, where "nearest" was determined using sjoin_nearest, right? sjoin_nearest is identifying the nearest polling place; google api is just getting us a distance by travel modality rather than straight-line distance, right?

dapoade commented 2 years ago

Correct. But isn't the travel modality more digestible / of greater interest because ultimately we are concerned with level of access, and straight line distance might not be as representative.

nickeubank commented 2 years ago

If we believed that it was really precise, then yes. My concern is that its giving an illusion of precision that is unrealistic -- measuring distances from the centroid of a campus / edge of a polygon is an inherently imprecise endeavor meant to estimate an approximate distance for the average student; pretending we're measuring walking times to the minute seems... a little contrived? Don't get me wrong, we should include them, but I think straight distance is a little more transparent, and crucially it's also a lot easier to deal with computationally, which makes life easier as we iterate on our data.

Put differently: I'm not sure that the difference between straight-line-distance and travel-modality distance is greater than the inherent uncertainty of what we're measuring.

nickeubank commented 2 years ago

(I also like "distance from polygon edge" as a metric more than from centroid -- hard to argue election administration officials should be doing more than putting a PP on every campus, which gets a "0" in distance from polygon, but non-zeros with a walking distance metric)

nickeubank / mtv_viacom_capstone

Follow-Up After 11/23 MTV Meeting #65