Modifying the `project` method to allow for points with known Route IDs

phildias commented 8 months ago

There have been several times where I've needed to project a GeoDataFrame with points onto a reference EventsCollection where I also happen to know which Route ID each point belongs to.

Currently, I need to make a workaround of looping over all the unique Route IDs and filtering both sets (the input GeoDataFrame and the reference EventsCollection), run the projection and then concatenate all the results.

However, it probably wouldn't be too difficult for us to change the project method's API to account for this. Specifically, we could add a keys parameter (either one column or set of columns) and the spatial search process gets filtered to only consider cases in the reference EventsCollection where the Route ID match.

Here's an example and my current workaround:

import pandas as pd
import geopandas as gpd
import linref as lr
from shapely.geometry import Point, LineString

# Reference GDF with known routes and DFOs
ref_gdf = gpd.GeoDataFrame({'Route_ID':['A','B','A','B'],
                            'BEG':[0,1,3,6],
                            'END':[2,3,5,8],
                            'geometry':[LineString(((0,0),
                                                    (0,1),
                                                    (0,2))),
                                        LineString(((0,1),
                                                    (0,2),
                                                    (0,3))),
                                        LineString(((3,1),
                                                    (3,2),
                                                    (3,3))),
                                        LineString(((6,1),
                                                    (6,2),
                                                    (6,3)))
                                        ]},
                           crs='epsg:4326')
# Note that Routes A and B physically overlap between (0,1) and (0,2)

# Input GDF which we will project onto the Reference GDF
in_gdf = gpd.GeoDataFrame({'Point_ID':[1,2,3,4],
                           'Known_Route_ID':['A','A','B','B'],
                           'geometry':[Point(0,0),
                                       Point(0,1.25),
                                       Point(0,1.75),
                                       Point(0,3)]},
                          crs='epsg:4326')
# Note that in this case, I already know that points 1 and 2 belong to Route A
# and that points 3 and 4 belong to Route B.

# First trying to do it all at once:
ref_ec = lr.EventsCollection(ref_gdf, 
                             keys=['Route_ID'],
                             beg='BEG',
                             end='END')

in_ec = ref_ec.project(in_gdf, buffer=0.5)

print(in_ec.df)
#    Point_ID Known_Route_ID                 geometry  index_right Route_ID   LOC
# 0         1              A  POINT (0.00000 0.00000)            0        A  0.00
# 1         2              A  POINT (0.00000 1.25000)            0        A  1.25
# 2         3              B  POINT (0.00000 1.75000)            0        A  1.75
# 3         4              B  POINT (0.00000 3.00000)            1        B  3.00

# Even though we know that Point 3 is supposed to be on Route B, it gets joined 
# to Route A due to the fact that the spatial join was done without taking into 
# consideration its known Route ID.

As I mention in the comment above, note that Point 3 is supposed to be on Route B, but it gets joined to Route A due to the fact that the spatial join from the project method was done without taking into consideration its known Route ID.

Here is how I typically get around this limitation:

# In this workaround, we loop over all the Route_IDs:
results = []
for i,this_route_id in enumerate(in_gdf['Known_Route_ID'].unique()):
    # Keeping only the relevant Route IDs for the input and 
    # reference GeoDataFrames
    this_in_gdf = (in_gdf
                   .query(f'Known_Route_ID=="{this_route_id}"')
                   .reset_index(drop=True))

    this_ref_gdf = (ref_gdf
                    .query(f'Route_ID=="{this_route_id}"')
                    .reset_index(drop=True))

    this_ref_ec = lr.EventsCollection(this_ref_gdf, 
                                      keys=['Route_ID'],
                                      beg='BEG',
                                      end='END')

    # Performing the projection only on the subset GeoDataFrame
    this_result_ec = this_ref_ec.project(this_in_gdf, buffer=0.5)

    results.append(this_result_ec.df.copy())

results_df = pd.concat(results, ignore_index=True)

print(results_df)
#    Point_ID Known_Route_ID                 geometry  index_right Route_ID   LOC
# 0         1              A  POINT (0.00000 0.00000)            0        A  0.00
# 1         2              A  POINT (0.00000 1.25000)            0        A  1.25
# 2         3              B  POINT (0.00000 1.75000)            0        B  1.75
# 3         4              B  POINT (0.00000 3.00000)            0        B  3.00

Do you think we can modify the project method to allow for known Route IDs?

tariqshihadah commented 8 months ago

This is an enhancement that's come up before and is probably worth implementing. I expect it shouldn't be too difficult to add this feature.

Should this be limited to the keys of the events collection being projected onto, or should this be a general matching parameter that can match any/any number of arbitrary fields between the events collection and the projected data frame, similar to pandas' merge feature?

An alternative workaround that I use involves setting nearest=False to get all projection hits within the buffer and then performing the filtering after the projection by dropping records where, e.g., Route_ID != Known_Route_ID. I expect this should produce an equivalent result to your approach but may be faster. This is probably approximately the approach I'd suggest for bringing this feature into the project method.

phildias commented 8 months ago

I really like the nearest=False solution! Much cleaner than the cumbersome workaround.

Regarding the question of sticking to just RouteID or multiple keys, I'm a bit conflicted... I think the rest of linref is built around just using the keys as the route identifier, and I can't think of all that many cases where we would need to use any other columns to narrow down the geographic search.

Having said so, I do see your point about how Pandas in general does joins & merges... Hmmm...

I think my vote would still go towards only using the RouteID keys to narrow the search. I feel doing so will keep things more aligned with the rest of how the library works.

tariqshihadah / linref

Modifying the `project` method to allow for points with known Route IDs #8