parallelize clique expansion

knaaptime commented 3 months ago

on relatively large graphs (n > ~10k) with colocated points, the clique expansion function can get really slow. After cutting it short a few times, I think this for-loop is where things are getting stuck. If that's the bottleneck, I think it should be pretty straightforward to parallelize using something like joblib

ljwolf commented 3 months ago

My original implementation was pure pandas, based on join and explode with no loops. Would it make sense to revisit that?

knaaptime commented 3 months ago

yeah i was wondering about explode there. I presume that would be even faster?

ljwolf commented 3 months ago

I think so.

We moved away from it because (a) hashing the points to establish unique locations was expensive and (b) our policy on sorting was different, but I think we can still use pandas join and explode given the new function signature.

The earlier implementation, conceptually:

Built a series indexed on unique locations with values being a list of all the original indices at that location. This used a groupby on the geometries and agg(list) on the ids
Explode the aggregated series
join the exploded series back to the adjacency table, allowing join to handle the one to many expansion for edges entering/leaving cliques.
separately, built the all-pairs links within each clique using a join.
stack the within clique and beyond clique tables.

pysal / libpysal

parallelize clique expansion #764