nickeubank / mtv_viacom_capstone

1 stars 0 forks source link

Merging SafeGraph and CPI 2020 Polling Data #27

Closed jgy4 closed 2 years ago

jgy4 commented 3 years ago

Will need to:

jgy4 commented 3 years ago

Looking into merging these, there were a few key takeaways:

The table below, though not ideal, is a snapshot of how I've been thinking about things. For each state it shows the count of inner joined polling places, the count of polling places from the CPI data, the count of polling places from the SafeGraph data, and what percentage of polling places for each data source are represented by the inner joined polling places.

In other words: The Safegraph % column = (Inner Count/SG Count). The CPI % column = (Inner Count/CPI Count).

I'm very open to thoughts on this! And look forward to discussing more at our next meeting.

mytable

nickeubank commented 2 years ago

This is great!

Merging on lat/long tends to be problematic as (a) different geocoding services may return different latitude and longitude for the same address, and (b) latitudes and longitudes are floats, so even if geocoding services were giving BASICALLY the same location, the floating point representations may differ in small ways that don't matter to humans but do matter to computers.

Let me think more on best merging strategy...

adrianefresh commented 2 years ago

Just want to say that this is terrific. Exactly what we needed to see/do in order to get a better sense of the data. Excellent work.

@nickeubank : do you think adding a buffer to the spatial join would give us a sense of whether we're dealing with small errors in diff lat/lon (and float precision)?

On Wed, Oct 20, 2021 at 8:42 AM Nick Eubank @.***> wrote:

This is great!

Merging on lat/long tends to be problematic as (a) different geocoding services may return different latitude and longitude for the same address, and (b) latitudes and longitudes are floats, so even if geocoding services were giving BASICALLY the same location, the floating point representations may differ in small ways that don't matter to humans but do matter to computers.

Let me think more on best merging strategy...

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/nickeubank/mtv_viacom_capstone/issues/27#issuecomment-947627561, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEXM52MHBATTU32JZGCW74LUH22LLANCNFSM5GAFH6DA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

nickeubank commented 2 years ago

I think she's doing a direct merge, not spatial.

@jgy4 You could round the lat-long to, like, the first or second decimal and see how you do merging them that way. Will still be imperfect, but should give us a slightly better sense of overlap.

jgy4 commented 2 years ago

Thank you both! I'll definitely try the merge with some rounding.

It is a direct merge - I did come across some documentation about spatial merges , but it seemed tailored to finding "points within a spatial region" or other overlapping areas - happy to look into that more as well!

nickeubank commented 2 years ago

The way you'd do what @adrianefresh is suggesting -- which, come to think of it, may be a better choice -- is to first buffer one set of points to convert them from points to circle (polygons). https://geopandas.org/docs/reference/api/geopandas.GeoSeries.buffer.html?highlight=buffer#geopandas.GeoSeries.buffer

Then you end up with one set of polygons (centered on your original points, but now larger) you can spatial merge with the other set of points. If two points were close (even if not identical), then the buffered circle will intersect the point.

jgy4 commented 2 years ago

So I implemented both and got similar results!

For the decimal place rounding - I varied the number of decimal places from 2 to 20. For the buffering I varied the input from 0.001 to 0.2.

With decimal place at the optimal point there were 40318 matches, 991 CPI overlaps (non-unique polling places), and 2848 SG overlaps.

With the smallest buffer input (0.001) there were 40215 matches, 990 CPI overlaps, and 2846 SG overlaps.

I think this means ~40,000 is fairly reliable matching - an improvement from 17,000! I wasn't able to get anything that prevented overlaps altogether. Just as a reminder, SafeGraph has 48,519 polling places and CPI has 66,831 polling places total.

I'm thinking about going with the buffer merge, and an outer join to catch as many polling places as possible? Also open to thoughts! Thanks again!

nickeubank commented 2 years ago

Great work @jgy4 !

For the buffering I varied the input from 0.001 to 0.2.

What were the units? 0.2... meters? kilometers? Or were you working in lat-long at the time?

jgy4 commented 2 years ago

I believe the units are the same as the units of the geometry, which in this case is the Equidistant Conic Projection of lat/long. I went smaller and started at 1e-10 and was finally able to see some variation in the number of matches!

Screen Shot 2021-10-26 at 9 05 26 PM
nickeubank commented 2 years ago

Great! Could you point me to the file with the buffer merge? Would love to poke around.

jgy4 commented 2 years ago

Yes! Just uploaded it here: #41

jgy4 commented 2 years ago

I updated the merge file so that it tracks the number of matches, the number of rows left unmatched for the CPI data, and the number of rows left unmatched for the SG data here: #41

I varied the buffer value from 0 to 200 meters this time, and kept the decimal values the same. The .py file returns six images still, 3 for each matching method. I include two from the buffer method below.

It's looking like CPI is closer to fully containing the SG data. CPI maintains over 20,000 polling places that aren't merged, but SG only maintain's a few thousand. @nickeubank Do you think this means going with CPI as the ground truth might be best?

Screen Shot 2021-10-28 at 9 00 44 PM
nickeubank commented 2 years ago

This is great, thanks @jgy4 ! This is exactly what I wanted. Now it's my job to play more with these potential mis-matches to understand them better. I think the next question is whether these are all in the same states (e.g. SG has Ohio, CPI does not), or whether we're failing to merge within the same localities. The later is obviously much more problematic...

nickeubank commented 2 years ago

Oh, I guess what you did on that above is still relatively valid. So this is "within locality" differences. That definitely makes me uncomfortable.

So I think the next thing to do is literally pick a case and look at the two datasets. Start with NC, and in particular durham county, since we know it. See how they compare by literally looking at the polling places they list. @jgy4 would you like to take a swing at that next? I need to do a detailed review of your other code before we worry too much about working on closest distance stuff, and pranav and dapo are working on the demographics/polygon merge.

nickeubank commented 2 years ago

Think about this as detective work, not "Data Science" -- literally pull out some rows. See where they differ. Can you understand how they differ? If they differ for, say, Duke campus, how do they differ? Can you figure out which one is right?

jgy4 commented 2 years ago

I can definitely take a swing at this - looking for differences in a familiar county, Thanks Nick!