nickeubank / mtv_viacom_capstone

1 stars 0 forks source link

New implementation of "closest" #29

Closed nickeubank closed 2 years ago

nickeubank commented 2 years ago

Closes #26

OK, first, to reiterate @jgy4 , your solution is really solid! It definitely does the job.

Also, y'all discovered that you could store shapefiles as csv files which... I had no idea geopandas would pull polygons out of a shapefile. Though I think in the future it makes more sense to use geojson or shp files, since those will also store information about the projection of the data, which you generally want embedded in with your data so they never get separated.

In this, I first converted your .ipynb to a .py (in VS Code, I exported to a Python script, then trimmed some of the cruff it leaves in). I think move to a .py makes it easier to track changes.

Then I did some digging around. Your solution is great, but likely inefficient because it's not using spatial indices or making full use of vectorization, and it just seemed like there had to be a ready-built solution to this. And turns out, there is!

sjoin_nearest from geopandas! Looks like it's pretty new to the library, but does exactly what we want. (Geopandas used to be based on a library called shapely, but is migrating to pygeos. This came from that move). It does the "distance to closest" calculation for all colleges in... 1.1 seconds. :) Yeah... anything you have to write by hand like this in Python will always be crushingly slow compared to something a CS major wrote in C...

(Note you'll be told you have to install pygeos for it to run if you don't have it yet.)

nickeubank commented 2 years ago

To see just how I modified the version of your file I exported to .py, you can just click on my second commit, or this link: https://github.com/nickeubank/mtv_viacom_capstone/pull/29/commits/d2f918fb9618bc1986c42c22b80e04731d4eeb9b

jgy4 commented 2 years ago

Thank you so much for looking into this Nick! This is fantastic!