rgeo / rgeo-activerecord

RGeo ActiveRecord extensions and tools for spatial connection adapters
Other
89 stars 64 forks source link

Best way to group by distance with a minimum number of ocurrances? #73

Closed Kutomore closed 1 year ago

Kutomore commented 1 year ago

I'm dipping my toes into spatial queries for a personal project, wondering what's the best way to group locations that are within a certain distance of eachother with a minimum amount of locations.

So if there are 20 locations within 1000m of eachother I'd want the result to be an array containing those locations for example.

I'm thinking of using ST_ClusterWithin or ST_ClusterDBSCAN and been scouting the Rails resources for doing such a thing. What would be the recomended way of approaching this?

Edit:

Using something like Location.select("*, ST_ClusterDBSCAN(coordinates, eps := 1000, minpoints := 20) over () AS cid").to_a.group_by(&:cid) works, but I'm wondering if I can do it in a cleaner manner.

Curiously enough ChatGPT suggests both cluster and .cluster_by_distance methods, I however could find no information about either online. So I'm thinking it mght be just wrong.

keithdoggett commented 1 year ago

@Kutomore If you want to use a more composable version of what you wrote, you can use the st_function method that is provided to all the ARel spatial columns.

For example:

window = Arel::Nodes::SqlLiteral.new("()")
cluster_query = Location.arel_table[:coordinates].st_function("ST_ClusterDBSCAN", [1000, 20], [false, false, false]).over(window).as('cid')
res = Location.select(Arel.star, cluster_query)
p res.to_sql
#=> SELECT *, ST_ClusterDBSCAN(\"locations\".\"coordinates\", 1000, 20) OVER () AS cid FROM \"locations\"

The way this works is that Location.arel_table[:coordinates] gets the column we want (i.e. locations.coordinates), then the st_function method states we're going to write a generic spatial function.

The first arg is the name of the spatial function (ST_ClusterDBSCAN), the next argument is an array of the inputs into the function after the first input (which must be the column we're writing the function on, locations.coordinates in this case).

Finally, the last argument is an array stating whether any of the other inputs or return values are geometries. In this case both eps and minpoints are not geometries so the last 2 entries are false, and ST_ClusterDBSCAN returns an integer, so the first value is false as well.

I'm not sure if this method is easier than what you wrote, but you could pretty easily wrap what I have above in a scope or another class method and easily call it with different inputs.

I'm going to close this since it's not a bug, but feel free to continue asking questions if something's unclear.