jeremysze / LPIS

Repository of code used to analyze LPIs in NYC
0 stars 0 forks source link

Nearest neighbor distance and ID #1

Closed jhconning closed 5 years ago

jhconning commented 5 years ago

Apropo your Nearest Neighbor intersection to LPIS.ipynb

1) small observation: To run correctly the notebook needs to have:

from shapely.ops import nearest_points

but that's missing..

2) Comparing nearest neighbor methods.

As you pointed out this looks slow via python. From a quick read it seems that it's because we're nto using a spatial index.

Partly, just to leave a note for my own purposes, here are some notes and comparisons on a simple way using geopandas' distance. First we create a helpfer function closest that measures the distance from an input point to all the points in another geodataframe `gdf' and it returns a tuple (a list of two values) one wit the distance in feet and another with the index position of the nearest point.

import geopandas as gpd

def closest(point, gdf):
    howfar = gdf.geometry.distance(point)
    return  howfar.min().astype('int'), howfar.idxmin().astype('int')

mdistances = lpis.geometry.apply(lambda x: closest(x, schools))
lpis[['distance_school','ID_nearest_school']] = mdistances.apply(pd.Series)

In the above the closest function returns the minimum distance between an (LPIS) point and all points in the schools geodataframe. It also returns the index position of that nearest school (which could be adapted to pick out an ID from a column).

We then loop over all (LPIS) points to generate a series mdistances list of tuples which we finally with thee the index is also the school ID we can 'unpack' the list of returned tuples to a dataframe to add relevant columns to the original lpis geodraframe.

Performance Method above: 140 ms ± 4.46 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

P.S. -- We can improve the simpler method on two fronts. (A) could speed up closest by avoiding searching the howfar list of distances twice, once for the index position of the minimum index and then again for the actual minimum distance... it suffices to find the index and then directly read howfar at just that index location; and (B) select the minimum distance 'ID' from an 'ID column in the target dataframe (rather than assume it's the same as the index).

Here is an adaptation that does both:

def closest(point, gdf, idcol):
    howfar = gdf.geometry.distance(point)
    ix = howfar.idxmin()
    id = gdf.loc[howfar.idxmin(), idcol]

    return  id, howfar[ix].astype('int')

mdistances = lpis.geometry.apply(lambda x: closest(x, schools))
lpis[['distance_school','ID_nearest_school']] = mdistances.apply(pd.Series)

This method:
133 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Not a huge speed up on this small neighborhood sized dataframe, but possibly useful if running on something bigger.

jhconning commented 5 years ago

Just updated the comment above... I removed text comparing the distance method to an adaptation of the nearest_neighbors method in your notebook... but I was getting nonsensical matches with the latter.. so something was wrong.

jeremysze commented 5 years ago

Thank you for commenting. I am trying to learn the python syntax, while making it work. Very frustrating exercise. Beginner python skills insufficient to figure out most things.

Let me see if I can understand the new closest function you wrote.

It takes in a point, the geodataframe and name of col of the geodataframe, It calculated a variable howfar, which contains the distance from a row of the geodataframe to the point input. Next, it calculates a variable ix, which looks at all the howfar distances, to find the minimum. Finally, with the minimum captured, it goes into geodataframe to find the name of col of geodataframe(id).

But I dont understand lpis[['distance_school','ID_nearest_school']] = mdistances.apply(pd.Series)

jhconning commented 5 years ago

It takes in a point, the geodataframe and name of col of the geodataframe, It calculated a variable howfar, which contains the distance from a row of the geodataframe to the point input.

yes, howfar is returned as a pandas series with distances from the initial point (e.g. a LPIS location) to all the points in the gdf (e.g. here all the schools in the city).

Next, it calculates a variable ix, which looks at all the howfar distances, to find the minimum.

Right, ix = howfar.idxmin() finds the index position of the minimum in the series. howfar.min() would have returned the actual minimum value but to not search howfar again I just retrieve it as howfar[ix] (and cast it as an integer since we don't care about fractions of a foot).

Finally, with the minimum captured, it goes into geodataframe to find the name of col of geodataframe(id).

But I dont understand lpis[['distance_school','ID_nearest_school']] = mdistances.apply(pd.Series)

That's a little pandas weirdness. What's going on is that each time we call closest on a single point we are returned a 'tuple' like this.

(9745, 1436)

That means that when we

mdistances = lpis.geometry.apply(lambda x: closest(x, schools))
lpis[['distance_school','ID_nearest_school']] = mdistances.apply(pd.Series)

mdistances is a pandas series of these tuples that starts by looking like this:

261 (654, 1433) 262 (699, 1440) 265 (403, 1752) ...

So we need to 'unpack' this list of tuples into a new dataframe with two new columns that we can insert back into the lpis dataframe. The line mdistances.apply(pd.Series) is in effect looping over each tuple row in mdistances and converting tuples with distance and school 'ID' like (654, 1433) into rows of a new dataframe that starts like.

0 1
654 1433
699 1440
403 1752

...

I am trying to learn the python syntax, while making it work. Very frustrating exercise. Beginner python skills insufficient to figure out most things.

python is easy but pandas has lots of quirks that take time to figure out. The truth is that I was stumped but searched stackoverflow to figure out that last line! https://stackoverflow.com/questions/22799300/how-to-unpack-a-series-of-tuples-in-pandas

mbaker21231 commented 5 years ago

Jonathan --

Nicely done! This is some useful code. Do you think it could be adapted to find which polygon in a series of polygons a point lies?

Matt

jeremysze commented 5 years ago

finally figured out how to use the function you wrote! Its great! So much better than clicking around in QGIS!

jeremysze commented 5 years ago

Nicely done! This is some useful code. Do you think it could be adapted to find which polygon in a series of polygons a point lies?

I turned a bunch of points into polygons, and use intersect to find points that lie within those polygons. But I'm not sure if this would help, since it isnt a series of polygons.

jhconning commented 5 years ago

@mbaker21231 I think yes to your question, but need a bit of adapting.

The notebook in fact shows that with spatial-indexing you get much faster tests for points within a polygon.

Let me adapt the example to something similar to what you're asking and using Jeremy's datasets. Suppose we start with a specific location intersection which I call lpispoint. lpispoint.buffer(3000) would create a polygon as a 3000-foot buffer around that point.

Suppose we want to know which of the 1822 schools fall in that particular polygon.

Without a spatial index the command would be: schools.intersects(lpispoint.buffer(3000))

Let's time it:

%%timeit
schools.intersects(lpispoint.buffer(3000))

7.61 ms ± 532 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

That may seem relatively fast but let's now do it with a spatial index.

With spatial index: I'll explain the code below. To just find the index positions of schools inside the bu the

%%timeit
list(sindex.intersection(lpispoint.buffer(3000).bounds))

131 µs ± 2.95 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

That's much faster... though to be fair the no-index method above did the extra work of also retrieving the dataframe records (not just the index positions). So lets do the same here:

%%timeit
schools.loc[sindex.intersection(lpispoint.buffer(3000).bounds)]

1.05 ms ± 188 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

That slows things down but things are still 7X faster than the method without an index.

Code explained: to use the spatial index we need to give it a rectangular box polygon, apparently because the spatial index is based on recursively nested rectangular shapes. So I've fed it the bounding box of the buffer. This means it's searching over a somewhat larger space than the method without an index (similar to the earlier notebook).

jhconning commented 5 years ago

@mbaker21231 Do your 'series of polygons' overlap? Or is the problem simply like determining to which of many non-overlapping counties a particular point in geography belongs?

jhconning commented 5 years ago

@mbaker21231

Do you think it could be adapted to find which polygon in a series of polygons a point lies?

Does a geopandas spatial join do what you're looking for?

Suppose we have a cities gdf with points and another world gdf with country polygons. A 'spatial join' will merge based on which points lie within (or intesect with) polygons.

cities_with_country = gpd.sjoin(cities, world, how="inner", op='intersects')

Notebook example here

jhconning commented 5 years ago

I dug deeper into the geospatial weeds and figured out how to do nearest neighbor searches using a spatial index. It gives BIG speed-ups.

Geopandas use objects and methods from shapely, RTree (for spatial indexes) and other geospatial libraries so we can use RTree's spatial index nearest, intersect and other methods.

Suppose we have two geodaframes (gdfs) each with a geometry column (storing shapely Point, Polygon or other shapes). One gdf is schools with 1822 NYC school locations. We want to find the nearest k schools to each of 3019 intersection locations in the lpis intersections gdf.

Preliminaries: To illustrate let's pick out one lpis locations and call it

In  [18]:   lpispoint = lpis.loc[1428, 'geometry']

In  [19]:  print(lpispoint)
Out [19]:  POINT (1016305.39486663 212446.9878853441)

This used pandas .loc indexing (indexing by row and column names) here choosing the shapely Point at row index name (position) 1428 and column name 'geometry'.

Next we ask geopandas for a bounding box of that point. This may seem silly as we get a box of base and height zero but the RTree nearest method wants the 'search from' location in this format:

In  [20]:  lpispoint.bounds
Out [19]:  
(1016305.3948666304,
 212446.98788534413,
 1016305.3948666304,
 212446.98788534413) 

Finding the nearest points: Geopandas can create an RTree spatial index:

In  [20]:  sindex = schools.sindex

Now we can search sindex to find the nearest school to lpispoint:

In  [21]:  sindex.nearest(lpispoint.bounds)
Out [21]:  <generator object Index._get_ids at 0x00000175030CF7D8>

Not useful yet, so let's turn the generator object into a list to see what's there:

In  [22]:  list(sindex.nearest(lpispoint.bounds))
Out [22]: [1751]

The school nearest to lpispoint is at index position 1751 in sindex which is also its position in the schools gdf on which it is based. Using pandas .iloc or index positioning to get the dataframe row at that position:

In  [23]: schools.iloc[list(sindex.nearest(lpispoint.bounds))]
Out [23]: 

image


In  [24]:  list(sindex.nearest(lpispoint.bounds, 3))
Out [24]: [1751, 1410, 1432]

Let's write a rtnearest function to find the closest school using this RT spatial index method and return the distance to that school plus any idcol columns we want returned to be inserted back into the lpis gdf:

def rtnearest(point, idcol):
    schoolidx  = list(sindex.nearest(point.bounds))[0]
    return schools.iloc[schoolidx][idcol], int(point.distance(schools.iloc[schoolidx]['geometry']))

mdistances = lpis.geometry.apply(lambda x: rtnearest(x,'ID') ) 
lpis[['nearest_school_id','distance_school']] = mdistances.apply(pd.Series)

lpis[['LPID', 'nearest_school_name', 'distance_school']].head(2)

image

Speed comparison:

This method returns fills out the info for 3019 nearest neighbors (searching against an sindex with 1822 schools) at an average speed of abot 1.68 seconds. Let's compare to the other methods

1. RTree's nearest method on a spatial index. 1.68 s ± 12.9 ms per loop

2. gpd.distance method (~ 32seconds) The brute force method at the top of this thread

3. Shapely's nearest_neighbors method without a spatial index: (~ 7 seconds). The method Jeremy adapted in his notebook .

So on account of the fact that the rtree nearest method use a spatial index and the other methods do not, it is about 4 times faster than shapely's own nearest_points and about 60 times faster than the brute force method. I suspect that for larger datasets the speed up is faster.

P.S. -- Yeah, doing this on a Sunday.. But I do have a life... going to yankee stadium in a few minutes.. I just got a bit obsessed with improving geopandas skill set this week.

jeremysze commented 5 years ago

I would be doing this on sunday too! But mid terms on Tuesday! The RTree method is super efficient. pandas .iloc is the function I was looking for all this while. I will try to implement this, because I use need to calculate nearest neighbor for a lot of different variables.

jhconning commented 5 years ago

today @mbaker21231 raised an eyebrow at my RTree efforts but then remarked with a smirk that Scipy's cKDTree might be a little faster.

So I checked it out and in fact cKDTree crushes the rtree method.

As an example recall that to find nearest schools (out of 1822) for 3019 separate LPIS locations it took the RTree method about 1.68 seconds (vs. about 32 seconds for the brute force method)...

cKDTree does essentially the same job in 3.52 miliseconds

%%timeit
dist, idx = tree.query(lpis,k=1)
3.49 ms ± 34.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

10/30 UPDATE: The notebook example now includes code for a ckdnearest function to find nearest neighbors from locations in gdfA to locations in gdfB and returns results as a two column dataframe that can be inserted back into gdfA.

With actual schools and lpis data and dataframe insertion the whole operation times at ~100ms which is very fast.

mbaker21231 commented 5 years ago

@jhconning --

I wouldn't say I raised an eyebrow! I just asked a question a few years back on Stack and someone helped me quite a bit. The answer which that answer pointed me to can be found here..

Actually, good to revisit this because I had been spinning my wheels on another project where I was trying to do something similar.

jhconning commented 5 years ago

ha, I didn't mean 'raise an eyebrow' as in disapproval... just to suggest this was a topic you already knew had a better solution ( as is often the case, and proved again here). ;)

That stackexchange post is useful and made me want to know more about hex binning.

On the topic of geospatial visualization, have you seen geoplot? It seems built on top of geopandas for easy visualization. @jeremysze might want to check it out the visuals for inspiration if only because many examples use NYC collision data. Check out for instance his Voronoi and KDE plot maps, and Matt you might enjoy this recreation of a classic miltary campaign sankey map

jeremysze commented 5 years ago

@jhconning I was checking the nearest neighbor(ckd tree method) and I found that a few intersections did not have a nearest neighbor (LPIS/Truck route).https://github.com/jeremysze/LPIS/blob/master/Signal%20intersection.ipynb

I'm not sure why is that the case. I'm thinking it could be due to the issue raised when we use loc. But that shouldnt be an issue because there are no missing IDS. "Passing list-likes to .loc or [] with any missing label will raise KeyError in the future, you can use .reindex() as an alternative."

jeremysze commented 5 years ago

@jhconning the issue might be from the signal intersection data set that I received from DOT. It seems there were duplicate locations in that shapefile. Still working on it.

jeremysze commented 5 years ago

@jhconning i removed those problematic intersections, and there are no more missing nearest neighbor issues.