Support property aggregation on clustered features

vibze commented 8 years ago

Hello, I've started exploring this lib and have a feature request: Provide reduce function(s) to clustering configuration, so it's possible to filter clusters by aggregated properties rather than just point_count.

mourner commented 8 years ago

Interesting feature. Can you describe your use case in detail? How should it aggregate the properties? E.g. suppose we cluster a million points where each has different property keys and values — how should the top cluster properties look like then?

vibze commented 8 years ago

User must provide an aggregation function, so in a case of points belonging to a cluster have different property keys and values or value types everything should be handled in that function. Mapbox-gl just applies it to the dataset.

Let's say I want to build a population heatmap using population data for every human settlement in area.

Feature example:

{
    "type": "Feature", 
    "properties": { 
        "osm_id": "26544289", 
        "name": "Алматы", 
        "type": "city", 
        "population": 1475400 
    }, 
    "geometry": { 
        "type": "Point", 
        "coordinates": [ 76.9458522, 43.2355947 ] 
    } 
}

Now, I want every cluster to have a sum population of all included points as a property. I configure the source like this:

map.addSource("points", { 
    "type": "geojson", 
    "data": "data/points.geojson",
    "cluster": true, 
    "clusterMaxZoom": 14, 
    "clusterRadius": 20,
    "clusterAggregations": {
        "population_sum": function(prevValue, feature, i, arr) {
            if (!('population' in feature) || isNaN(parseFloat(feature.population))) {
                return prevValue;  // handle the case of missing keys or wrong type
            }

            return prevValue + parseFloat(feature.population);
        }
    }
});

... and every cluster from this source receives the population_sum property to work with.

mourner commented 8 years ago

I believe we won't be able to accept a function, since the mapbox style specification only allows valid JSON. Would it be enough to have something like this?

clusterAggregate: ["population"]

Then it would just do a sum for all given properties.

vibze commented 8 years ago

It would be enough for certain cases, but definitely not as flexible. Do you think something like this http://bl.ocks.org/gisminister/10001728 could be possible with Mapbox GL API?

lavishmantri1 commented 8 years ago

Can also have some common mathematical functions clusterAggregateFunction: "SUM" or clusterAggregateFunction: "AVG"?

That would be helpful.

ryanbaumann commented 8 years ago

+1 for this feature request. Specifically: 1) Ability to specify one or multiple property columns to aggregate 2) Ability to select the aggregation caclulation type.

Requested aggregate calculations for circle-cluster property data to support would be min, max, average, and sum.

As an example, here is the map visual I want to create. To achieve this today, I aggregate the geojson data on the server to return one point with the aggregated "property" value I want.

I'd like to completely port my heatmap calculations over to the new circle-cluster and circle data-driven style. The ask would be to be able to return the raw geojson point data from my data set, and cluster it using the new circle-cluster / geojson property aggregation feature.

adv_dds_example

One challenge may be enforcing data types - geojson property data could be strings, floats, ints, etc. Some aggregate calcs may apply to certain data types but not others.

idlikesometea commented 8 years ago

Hi, is there any news on this feature? It would be awesome to have it on GL.

davecranwell commented 8 years ago

I'd love this functionality too. I'm attempting to display several million points on a map, retrieved by an Elasticsearch geohash aggregate query (https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-geohashgrid-aggregation.html). This returns data where each row already includes a sum of the points within a certain area.

As it stands the mapbox gl clustering code would only sum the number of results returned not the already-summed values within each result, which totals several million.

davecranwell commented 8 years ago

I note this is already an issue on supercluster: https://github.com/mapbox/supercluster/pull/12

redbmk commented 8 years ago

I've been playing around with this and have something working with supercluster to pass in an arbitrary function that will let you aggregate properties.

Then I was thinking for mapbox you could pass in something like clusterAggregates: { population: 'sum', minimum_wage: 'min' }, then mapbox would pass a function to supercluster based on that. In this case the cluster would get a population property with the sum of all the population properties, and a minimum_wage property with the minimum of all minimum_wage properties in the cluster.

It would be simple enough to have sum, max, and min functions to start, which could expand from there.

I have a feeling there will be a lot of red tape around this though - I pretty much have it coded already but it requires changes to supercluster, mapbox-gl, and mapbox-gl-style-spec.

lucaswoj commented 8 years ago

pass in an arbitrary function that will let you aggregate properties

As stated above we cannot accept arbitrary functions for technical reasons.

I have a feeling there will be a lot of red tape around this though - I pretty much have it coded already but it requires changes to supercluster, mapbox-gl, and mapbox-gl-style-spec.

If you're serious about working on this project (:heart:), please open a ticket in the mapbox/mapbox-gl repo and we'll walk you through the changes step-by-step! It's really quite straightforward.

davecranwell commented 8 years ago

I had a stab at this too and am using this fork in production: https://github.com/davecranwell/mapbox-gl-js

It uses a PR made by a kind person on supercluster (https://github.com/mapbox/supercluster/pull/12) and my own fork of mapbox-gl-style-spec (https://github.com/davecranwell/mapbox-gl-style-spec) which adds support for the field added in the PR.

Was unsure whether to submit this PRs since the one on supercluster has been lurking for a while now.

redbmk commented 8 years ago

As stated above we cannot accept arbitrary functions for technical reasons.

@lucaswoj My understanding is that we can't pass a function into the mapbox styles because they only take JSON - I'd imagine part of that reasoning is because data passed to a worker needs to be serializable. However, supercluster could take any function and mapbox-gl could be the one to pass in the function (from the worker) based on the JSON you give it. Please correct me if I'm understanding something wrong.

I'd be happy to work on the project - I'll get everything together soon and open a ticket.

@davecranwell The mapbox/supercluster#12 pr is really similar to what I have, but the problem I have with it is you can only work with one property (metric). I think it would be handy to be able to aggregate multiple properties, so for example you can get the sum of one property, and the max of another, for example. I guess in theory you could have a property that is a hash and use that as the metric, but then with just adding clusterMetric you're stuck with the default reducer, which is a summing function.

lucaswoj commented 8 years ago

However, supercluster could take any function and mapbox-gl could be the one to pass in the function (from the worker) based on the JSON you give it.

Can you provide a concrete example (pretend Javascript) of how this would work?

I think it would be handy to be able to aggregate multiple properties, so for example you can get the sum of one property, and the max of another, for example.

Agreed. Whatever solution we end up merging will most likely work the way you describe.

redbmk commented 8 years ago

Sure,

I still need to write tests and make changes to the spec. Also I figure it will need to be tested and code reviewed for style, browser support, error checking, safety/security, speed, etc. Regardless here's what I have so far:

supercluster changes mapbox-gl-js changes

Then in your mapbox style you could provide an option like this:

clusterAggregates: {
  property1: 'sum',
  property2: 'min',
  property3: 'min',
  property4: 'max'
}

For now there's no error checking, so if you don't have those properties you'll just get some concatted strings or some NaN's or something else weird probably. It just assumes those properties will be on each point, and carries the accumulated values over to the clusters. So your clusters will end up with the same properties, but with aggregated values. You could still filter the clusters and style them differently if you want by using point_count like normal.

lamuertepeluda commented 7 years ago

Any news on this topic?

I'd like to have a behavior for MGL similar to https://github.com/SINTEF-9012/PruneCluster which works great with Leaflet.

lucaswoj commented 7 years ago

@lamuertepeluda per https://www.mapbox.com/mapbox-gl-js/roadmap/ this feature is under active development right now 😄

lamuertepeluda commented 7 years ago

@lucaswoj that's great news. I started contributing to the clustering engine, "supercluster". See https://github.com/redbmk/supercluster/pull/2 and https://github.com/mapbox/supercluster/pull/12#issuecomment-275718245 I hope they get merged soon.

I'll also try to fork mapbox-gl and add the accumulator parameter to the geojson source. I think it should have good enough performances.

redbmk commented 7 years ago

Hey, I've been pretty busy but I'll look into that pull request hopefully tonight. I'd actually been working off the cluster-agg-hash-array branch after some discussion about the style spec in mapbox/DEPRECATED-mapbox-gl#26. I'll merge that into master (in redbmk/supercluster) as well.

@lucaswoj are we still depending on mapbox/mapbox-gl-style-spec#47 (it looks like the discussion for this has moved around a bit, but I think that's the latest). I don't see it on the roadmap.

redbmk commented 7 years ago

Alright I updated the master and cluster-agg-hash-array branches of redbmk/supercluster.

We still need to come to some consensus on the syntax and I feel like the style spec needs to be fleshed out a bit. Right now it just accepts any object and has a comment suggesting how to use it, but it seems like there should be a way to specify exactly what options are allowed.

lamuertepeluda commented 7 years ago

I think the aggregator function should just be a parameter of the geojson source (worker options), see the workerOptions configuration rather than the style object.

Once you do this, you should get the custom properties defined in the aggregator function together with _{pointcount} for each cluster like in the cluster example.

I tried upgrading supercluster to the current version, and this makes _{pointcount} and _{clusterid} available for each cluster. So I think it would be enough to add an aggregator function parameter to the addSource that get passed down to the supercluster constructor as modified by @redbmk and me to get the custom aggregations for each point.

This wouldn't require any changes to the current style specs but just to add some other layers, that like the "cluster-count" layer of the example, only represent the cluster population information. It would be easy to get a representation like in the left picture of PruneCluster categories example, while to get the representation on the right, there should be some additional border specs parameter.

I already managed to do this but I was missing the information about the cluster population, that's why I switched to the mod of supercluster.

redbmk commented 7 years ago

That's how I have it coded here. The code is over a month behind from master now, but it's the same idea you mentioned, going through workerOptions.

Ideally we would be able to add an aggregator function via the style spec or via the addSource function. That way you could add it to the original style and/or via addSource.

I believe it's technically possible to add arbitrary functions to a web worker by creating a Blob. If that's true, maybe we could pass in a function as a string that would be parsed in the worker. Maybe the string could be either a url to a function (which could be a blob url or an external script), or a string representation of a function (e.g. "" + function () { return 'hello world' }) which would be converted to a blob before being passed into the worker. That might also help solve mapbox/mapbox-gl-style-spec#47 by just letting the user pass in any function as a string.

lamuertepeluda commented 7 years ago

Indeed. I would do that either by using Blob or some library such as http://www.eslinstructor.net/vkthread/ that seems to allow passing functions defined in the main script to the workers

lamuertepeluda commented 7 years ago

Alright @redbmk I updated the demo of your supercluster fork by adding the Blob approach you mentioned here.

I had a look to the mapbox-gl-js code and I have to figure out how this could blend into the more complex worker instantiation mechanism, but I definitely think it is technically possible to do so.

lucaswoj commented 7 years ago

That might also help solve mapbox/mapbox-gl-style-spec#47 by just letting the user pass in any function as a string.

This is a non-starter for us because we need to have parity with GL Native which does not necessarily have a Javascript interpreter available.

redbmk commented 7 years ago

Oh, whoops yeah that's a major oversight on my part!

If we end up going with @mourner's changes in mapbox/supercluster#36 then the style spec could have properties like clusterInit, clusterMap and clusterReduce that get parsed and passed in. Maybe you could do something like clusterInit: { min: Infinity, sum: 0 }. The map and reduce functions could then use the same syntax we land on for mapbox/mapbox-gl-style-spec#47, which I'd imagine should be able to do simple math and have access to property values.

mollymerp commented 7 years ago

@mourner the supercluster level changes have shipped right? All we need to do now is adapt the gl-js api to take advantage of the new functionality? Are you up for taking this on or would you like someone else/me to take this?

mourner commented 7 years ago

@mollymerp for property aggregations, we would need arbitrary expressions support in the style spec, which is quite a big task. I know @kkaefer approached it with some experiments on the native side at some point. It would be great to start looking at this again.

mollymerp commented 7 years ago

cool – ticket tracking that feature request is at #4077

johnlaur commented 7 years ago

It is disheartening that this has been stalled so long after the supercluster changes. Even without arbex and all of the style spec changes and the like it would be great to at least have the production release of mapbox-gl-js built with the current release of supercluster 2.3.0 instead of 2.2.0. This would expose cluster_id in the feature properties so that an independently initialized supercluster object can be cross referenced by cluster_id. Right now correlating the two requires a query by bounding box which is neither reliable nor efficient.

map.addSource("data", { type: "geojson", data: "data.geojson", cluster: true });

// We can do whatever we want to our own supercluster, feature parity be damned
var cluster = supercluster({});
cluster.load(map.getSource('data').serialize().data.features);

...

// Get a cluster feature on click or whatever
var features = map.queryRenderedFeatures(e.point, { layers: ["data"] });
// Requires mabbox-gl-js built with supercluster ^2.3.0
var cluster_id = features[0].properties.cluster_id;
// Get every feature within the cluster
var all_features = cluster.getLeaves(cluster_id, Math.floor(map.getZoom()), limit = Infinity);

jfirebaugh commented 7 years ago

@johnlaur Want to submit a PR that updates the supercluster version?

ryanbaumann commented 7 years ago

Here is an interim solution to property aggregation using Supercluster in GL JS.

https://bl.ocks.org/ryanbaumann/01b2c7fc0ddb7b27f6a72217bd1461ad

ezgif com-gif-maker

benderlidze commented 7 years ago

Any updates on this?

redbmk commented 7 years ago

@benderlidze there is a lot of activity happening at #4777 to allow for arbitrary expressions. Once that is complete, it will be possible to use those expressions to add support for property aggregation.

kkaefer commented 7 years ago

In addition to that v0.39, which was released a few weeks ago, ships with an updated version of supercluster that supports property clustering.

mourner commented 7 years ago

@kkaefer there's no way to take advantage of that supercluster feature (without forking GL JS) until we expose it through the style spec.

joeyramoni commented 6 years ago

Any news on this feature? Alternative workaround?

lamuertepeluda commented 6 years ago

Any e.t.a for this feature? It's a long time since this issue (and other similar ones) has been opened, and the problem has been already resolved on the superclusterjs side...

lamuertepeluda commented 6 years ago

I attempted a new workaround approach (which is a bit complex and I'd like to publish as an example on dedicated github repo as soon as I have time) but mainly works like this:

have a worker thread which fetches the data (xhr) and clusterizes them with supercluster@3.x and custom map/reduce
- the worker threads stores the clusterized GeoJSON feature collection data in a blob URL (URL.createObjectURL)
- when it receives a getClusters message with bbox and zoom, it invokes the index.getClusters returns the new blob URL of the clusterized data
the main thread instantiates a source (cluster: false) which receives the blob URL as data option, and on each moveend (debounced) it posts a getClusters message with bbox and zoom
- when it receives the getClusters message from the worker, it invokes source.setData('clusteredFeaturesUrl')

Each cluster gets the custom stats defined in the worker. It seems to be fluid enough and waste less memory even with ~300 point features.

Of course it is suboptimal and I want to verify some other details, but it basically works and it's maybe better than having a copy of supercluster in the main thread, which ~sucks~ I don't like at all.

JoJ123 commented 6 years ago

@redbmk

@benderlidze there is a lot of activity happening at #4777 to allow for arbitrary expressions. Once that is complete, it will be possible to use those expressions to add support for property aggregation.

Is there an update on this? We would also like to add this feature in our project.

tvler commented 6 years ago

This would be a great feature for something we're working on at Open Listings

Previously we've just been displaying homes on our map, which has been working fine with clustering. But we're currently working on coalescing homes that are all in the same building into a singular geojson feature, so there'll be less markers on our maps in densely populated areas.

With the way clusters are working now, the point_count won't be correct if building's are clustered in –– it'll just count as 1, rather than the total number of homes in the building.

I think if we store the building home count in our geojson metadata, then there should be a way to render the cluster value we want if a feature like this gets added!

redbmk commented 6 years ago

I've been really busy but I would love to work on this again now that Arbex is ready. Maybe I can make some time this weekend to take a look.

senoj commented 6 years ago

This is something I have been patiently waiting to be implemented so its good to see this topic looked at again.

Im working on a project that needs this feature pretty badly. It is intended to monitor stores that are online/offline/mobile backup. Clusters would need to follow 3 rules:

1) If all stores are online in a cluster it will be green 2) If one store is offline in a cluster it will be red 3) If one store is on mobile backup it will be orange

Right now it looks like this, unclustered points are working as expected:

foundryspatial-duncan commented 6 years ago

@senoj you can do that right now with the external supercluster library, see ryanbaumann's post above. I successfully used that technique to render pie charts showing the ratios of different properties in the clusters. It'll definitely be simpler when that's part of the mapbox-api, but if your business needs are urgent, you can do it now :+1:

senoj commented 6 years ago

@foundryspatial-duncan its not urgent, if there is no traction on this over the next month ill look at this. Id hate to put the work in and shortly after have it added to the mapbox-api :)

JoJ123 commented 6 years ago

@senoj @foundryspatial-duncan I also started to implement it with supercluster. This is working quite well. But I would also prefer, when it's integrated in the mapbox-api.

redbmk commented 6 years ago

@kkaefer @qwervo @lamuertepeluda @JoJ123 @tvler @senoj

I created a PR for this (#7004). I'm sure the syntax will change slightly after it's reviewed by the maintainers, but if you want to test it out now, that would help find any bugs or issues that may arise. Any feedback is welcome. Thanks!

Also, if anybody is interested in helping out, I still need to do some benchmarking, make sure the documentation is up to date everywhere, etc. Any help would be much appreciated!

mourner commented 5 years ago

Picking this up again, here's a PR proposing a different syntax/implementation for this feature https://github.com/mapbox/mapbox-gl-js/pull/7584 — check it out!

mourner commented 5 years ago

Last call for comments on #7584! It's ready for review and likely to be merged soon.

mapbox / mapbox-gl-js

Support property aggregation on clustered features #2412