SuperclusterMutable.insert() and SuperclusterMutable.load() produce different results

Robbendebiene commented 1 year ago

Hi, it's me again :)

Previously I was using SuperclusterImmutable.load() (which was working well for me), but recently I wanted to switch to SuperclusterMutable since the underlying data structure changed. I noticed that it produces different results when I use the insert method. To me the one produced by load looks "more correct" to me. To verify the differences I wrote a little test:

import 'package:flutter_test/flutter_test.dart';
import 'package:latlong2/latlong.dart';
import 'package:supercluster/supercluster.dart';

void main() async {
  final points = <LatLng>[
    LatLng(50.813443, 12.931275),
    LatLng(50.813362, 12.931059),
    LatLng(50.814048, 12.931202),
    LatLng(50.814171, 12.931484),
    LatLng(50.814273, 12.931415),
    LatLng(50.813756, 12.93058),
    LatLng(50.813559, 12.930912),
    LatLng(50.813859, 12.930643),
    LatLng(50.813898, 12.930761),
    LatLng(50.813623, 12.931012),
  ];

  final mutableSuperCluster = SuperclusterMutable<LatLng>(
    getX: (p) => p.longitude,
    getY: (p) => p.latitude,
    minZoom: 0,
    maxZoom: 20,
    radius: 120,
    extent: 512,
    nodeSize: 64,
  );

  void testCluster(int zoom) {
    for (final p in points) {
      mutableSuperCluster.insert(p);
    }
    final mutableClustersWithInsert = mutableSuperCluster.search(-180, -85, 180, 85, zoom);

    mutableSuperCluster.load(points);
    final mutableClustersWithLoad = mutableSuperCluster.search(-180, -85, 180, 85, zoom);

    expect(mutableClustersWithInsert.length, equals(mutableClustersWithLoad.length));

    for (var i = 0; i < mutableClustersWithInsert.length; i++) {
      expect(mutableClustersWithInsert[i].numPoints, equals(mutableClustersWithLoad[i].numPoints));
      expect(mutableClustersWithInsert[i].x, equals(mutableClustersWithLoad[i].x));
      expect(mutableClustersWithInsert[i].y, equals(mutableClustersWithLoad[i].y));
    }
  }

  test('test clusters', () {
    testCluster(17); // throws: Expected: <2> Actual: <3>
  });

  test('test clusters', () {
    testCluster(0); //throws: Expected: <10> Actual: <20>
  });
}

I also noticed that SuperclusterImmutable.load() and SuperclusterMutable.load() produce different results:

See test

```dart import 'package:flutter_test/flutter_test.dart'; import 'package:latlong2/latlong.dart'; import 'package:supercluster/supercluster.dart'; void main() async { final points = [ LatLng(50.813443, 12.931275), LatLng(50.813362, 12.931059), LatLng(50.814048, 12.931202), LatLng(50.814171, 12.931484), LatLng(50.814273, 12.931415), LatLng(50.813756, 12.93058), LatLng(50.813559, 12.930912), LatLng(50.813859, 12.930643), LatLng(50.813898, 12.930761), LatLng(50.813623, 12.931012), ]; final mutableSuperCluster = SuperclusterMutable( getX: (p) => p.longitude, getY: (p) => p.latitude, minZoom: 0, maxZoom: 20, radius: 120, extent: 512, nodeSize: 64, ); final immutableSuperCluster = SuperclusterImmutable( getX: (p) => p.longitude, getY: (p) => p.latitude, minZoom: 0, maxZoom: 20, radius: 120, extent: 512, nodeSize: 64, ); void testCluster(int zoom) { mutableSuperCluster.load(points); immutableSuperCluster.load(points); final mutableClusters = mutableSuperCluster.search(-180, -85, 180, 85, zoom); final immutableClusters = immutableSuperCluster.search(-180, -85, 180, 85, zoom); expect(mutableClusters.length, equals(immutableClusters.length)); for (var i = 0; i < mutableClusters.length; i++) { expect(mutableClusters[i].numPoints, equals(immutableClusters[i].numPoints)); expect(mutableClusters[i].x, equals(immutableClusters[i].x)); expect(mutableClusters[i].y, equals(immutableClusters[i].y)); } } test('test clusters', () { testCluster(17); // throws: Expected: <0.5359196236111111> Actual: <0.5359202083333333> }); test('test clusters', () { testCluster(0); //throws: Expected: <0.5359195397222223> Actual: <0.5359202083333333> }); } ```

As always thanks for your work!

rorystephenson commented 1 year ago

Hi @Robbendebiene thanks for the super detailed issue description!

It is expected that load/insert do not produce the same clusters due to how the underlying algorithm works. That said I've never really compared the clusters resulting from loading/inserting so I can't say for sure that something isn't amiss.

I was surprised by the testCluster(0) result since 20 points is twice the amount of points there should be, I ran it locally and moving the mutableSuperCluster declaration inside of the testCluster method resolved that (it failed on the cluster x/y comparison instead) so I think that may be a bug in the test. That still doesn't mean that there isn't something amiss so I will try to have a further look.

As far as mutable/immutable producing different cluster positions that's unexpected and I need to have a look why that is.

I hope to have more info soon!

rorystephenson commented 1 year ago

@Robbendebiene I've just released a new version with some fixes and couple of tests which make sure that mutable/immutable produce the same clusters. The cluster positions can have extremely small differences due to adding floating point numbers in different orders but given that it the cluster positions match to about the 15th decimal place and making them exactly the same would require sorting cluster points before calculating a cluster's weighted x/y I don't intend to change this.

I'm yet to dig in to load/insert differences for mutable clusters although I have some other work to tackle before I can look at that.

Robbendebiene commented 1 year ago

Hi, thanks for your reply.

I was surprised by the testCluster(0) result since 20 points is twice the amount of points there should be, I ran it locally and moving the mutableSuperCluster declaration inside of the testCluster method resolved that

Ah my bad, sorry for that.

It is expected that load/insert do not produce the same clusters due to how the underlying algorithm works. That said I've never really compared the clusters resulting from loading/inserting so I can't say for sure that something isn't amiss.

What differences would you expect? Just a different sorting order of the points? Or some real differences in clusters, like different sizes etc.?

The cluster positions can have extremely small differences due to adding floating point numbers in different orders but given that it the cluster positions match to about the 15th decimal place and making them exactly the same would require sorting cluster points before calculating a cluster's weighted x/y I don't intend to change this.

I also think this isn't necessary.

I've just released a new version with some fixes and couple of tests which make sure that mutable/immutable produce the same clusters.

Cool, I'll give it a try!

rorystephenson commented 1 year ago

It is expected that load/insert do not produce the same clusters due to how the underlying algorithm works. That said I've never really compared the clusters resulting from loading/inserting so I can't say for sure that something isn't amiss.

What differences would you expect? Just a different sorting order of the points? Or some real differences in clusters, like different sizes etc.?

So for some background this package is a port of https://github.com/mapbox/supercluster which is not mutable. When I added the mutable supercluster implementation I initially tried to make an insertion add the point and then iterate up the zoom levels from the highest zoom, modifying only the affected clusters (e.g. updating their cluster data, lowest/highest zoom values etc). In practice this turned out to be horrendously messy since and I ended up opting for a much simpler but slightly less efficient implementation (which in my experience is still plenty fast). So the way insertion works is:

Starting from the highest zoom add the point to the index. If it clusters go to 2, otherwise repeat with the next lowest zoom level until minZoom is reached.
Remove the point/cluster with which the new point clusters and remove all of its ancestors (all of the matching/parent clusters at lower zoom levels).
Add the new cluster formed by combing the point with the matching points/cluster from step 1.
Iterate upwards (towards minzoom) and re-cluster all of the removed elements.

When clustering a set of points/clusters the order in which they are passed to the clustering logic affects the output. When looking for matching points in step 1 and rebuilding clusters in step 4 the order of the points/clusters is determined by the underlying index which does not give an order guarantee. This means that the clusters may change when adding a point.

If you wanted to make insertion more stable it would be possible to store an order value (ascending int) on the points/clusters which could be used to order the points/clusters before passing them to the clustering logic. This would have a performance penalty depending on how densely the points are clustered.

I'm very open to suggestions/PRs to improve this!

Robbendebiene commented 1 year ago

Thanks for the detailed explanation. It makes sense that the input order affects the resulting clusters.

So I'm somehow abusing the clustering approach to get some sort of "marker collision" like e.g. Google does when you set it to OPTIONAL_AND_HIDES_LOWER_PRIORITY. What I basically do is creating clusters for roughly double the size of a marker. Then I simply loop over all clusters and just display 1 marker per cluster.

This used to work pretty well using SuperClusterImmutable:

Using SuperClusterMutable:

I guess the reason this doesn't lead to the same results with SuperClusterMutable is because it uses rbush while SuperClusterImmutable uses kdbush which can lead to different cluster results even with the same markers and order?

rorystephenson commented 1 year ago

Thanks for the detailed explanation. It makes sense that the input order affects the resulting clusters.

So I'm somehow abusing the clustering approach to get some sort of "marker collision" like e.g. Google does when you set it to OPTIONAL_AND_HIDES_LOWER_PRIORITY. What I basically do is creating clusters for roughly double the size of a marker. Then I simply loop over all clusters and just display 1 marker per cluster.

This used to work pretty well using SuperClusterImmutable:

Using SuperClusterMutable:

I guess the reason this doesn't lead to the same results with SuperClusterMutable is because it uses rbush while SuperClusterImmutable uses kdbush which can lead to different cluster results even with the same markers and order?

Hey @Robbendebiene thanks for sharing those images. I'm surprised they are so different, was the second one created by inserting one-by-one?

Robbendebiene commented 1 year ago

Yes the second one was created by inserting one by one.

Using load Immutable and Mutable cluster only differ when showing the last element from a cluster as the representative element. If using the first one they look identical. So I guess the clusters are identical but have a different element order.

I tried to create a minimal example to better visualize the differences between those 3 methods. However now the insertion method does fail completely with Failed to find point (f6d36afd-92eb-4a10-97c7-74507b6b1669 - null) to remove from zoom (14)

I attached the example below.

cluster_example.zip

rorystephenson commented 1 year ago

I will try to get to this soon @Robbendebiene , thanks for the example.

rorystephenson commented 1 year ago

@Robbendebiene If you're able to create a minimal example which reproduces this error it'll be easier for me to get to the bottom of the issue.

Robbendebiene commented 1 year ago

@rorystephenson Have you tried the example I attached above?

rorystephenson commented 1 year ago

@rorystephenson Have you tried the example I attached above?

Yes and it errored as you mentioned, but I'm not familiar with the app and it will take me some time to understand exactly what's going on. If you could reduce it to a test case or a minimal dart-only code excerpt it will be much easier for me to debug. In any case I'm moving house over the next two days, so I won't be able to look at this for some days.

Robbendebiene commented 1 year ago

So to reproduce the E/flutter ( 6555): [ERROR:flutter/runtime/dart_vm_initializer.cc(41)] Unhandled Exception: Failed to find point (bedb261d-bcbc-44ba-9a97-e8eb5ab54d93 - null) to remove from zoom (13) error this is as minimal as I could get:

import 'package:latlong2/latlong.dart';
import 'package:supercluster/supercluster.dart';

final _points = [
  // if one of the points gets removed the error doesn't occur, probably because the clustering changes
  LatLng(45.44088, 12.328761), LatLng(45.440444, 12.337212), LatLng(45.440539, 12.337038), LatLng(45.438299, 12.323898), LatLng(45.437132, 12.32746)
];

void main() {
  clusterByInsert(_points);
}

SuperclusterMutable<LatLng> clusterByInsert(List<LatLng> points) {
  final cl = SuperclusterMutable<LatLng>(
    getX: (p) => p.longitude,
    getY: (p) => p.latitude,
    minZoom: 0,
    maxZoom: 20,
    radius: 120,
    extent: 512,
    nodeSize: 64,
  );

  for (final point in points) {
    cl.insert(point);
  }
  return cl;
}

Just want to remind, that this is not my original problem, which is best seen visually in the example I attached 🙈

rorystephenson commented 1 year ago

Hey @Robbendebiene a little update.

Firstly I didn't realise the app you shared was custom made to show the difference in clustering, sorry about that. It was very useful for visualising the difference once I fixed the bug (locally) that was causing it to throw an exception.

I've actually spent a LOT of time looking at this and it's quite complicated. The original insert implementation was more or less created for my own needs in an app that bulk loaded 99% of its markers and then had small amounts of inserts/removals. I just wanted a fast and working implementation for my needs and I didn't notice the big difference in clustering when inserting. Your demo app does a great job of demonstrating that when inserting the clustering is not aggressive enough which results in more clusters/points at low zooms.

The tricky part is that the supercluster algorithm is fast because it is a greedy algorithm and replicating this when inserting without ending up with a slow implementation is tricky. I have a version working locally which is much more aggressive in its clustering but it is also way slower. For 10,000 randomly generated points calling load() finishes in 0.3 seconds whilst inserting one by one takes around 30 seconds... which is obviously terrible. That can probably be improved a lot by adding an insertAll() method which allows bulk inserting, that's the next thing I want to try. I presume that would be useful for your use-case, i.e. I'm guessing you need to use insert because you already have some points loaded and want to insert more?

I can go in to more details as to why it's slow if you're interested.

rorystephenson commented 1 year ago

@Robbendebiene another update.

So I've added an insertAll method locally and the performance is significantly better whilst still generating a similar amount of clusters to loading. The clusters are not the same ones but that's expected due to how the algorithm works.

Here are the benchmarks:

Starting SuperclusterImmutable load 10,000 benchmark... took: 0:00:00.220195
Starting SuperclusterMutable load 10,000 benchmark... took: 0:00:00.281985
Starting SuperclusterMutable load 9000, insert 1000 benchmark... took: 0:00:25.870553
Starting SuperclusterMutable insertAll 10,000 benchmark... took: 0:00:00.402941
Starting SuperclusterMutable insertAll 10 x 1000 benchmark... took: 0:00:01.506750

So insertAll with 10,000 takes only 200ms longer than load. A more realistic comparison is insertAll doing insertions of 1000 points at a time and that takes 1500ms which is slower but still waaaaaaay better than the performance of inserting one-by-one (25 seconds).

I need to do some more testing locally to be 100% sure that everything is correct and I want to change the removal implementation as I think it suffers from the same issue that the old insert suffered from (not creating enough clusters).

Will keep you updated.

Robbendebiene commented 1 year ago

Hi @rorystephenson Thanks for the extensive reply and all your time you put into it.

That can probably be improved a lot by adding an insertAll() method which allows bulk inserting, that's the next thing I want to try. I presume that would be useful for your use-case, i.e. I'm guessing you need to use insert because you already have some points loaded and want to insert more?

Yes, that's the exact reason.

I can go in to more details as to why it's slow if you're interested.

Unfortunately I only have a very basic understating of R-Tree and K-D-Tree so I feel like I'm currently not able to grasp any deeper explanation. My naive approach would have been that there is a Tree for each zoom level that contains the elements/clusters. Now when adding a new element one could go from top to bottom or vice versa. Something like: for each zoom level "check if the new element fits to a cluster that's already present". If yes then put it into the cluster and go deeper, if no then create a new cluster / insert the element. But I guess there are a lot of problems I currently can't see due to the lack of deeper understanding. Especially the last part "new cluster / insert the element" probably causes problems.

So insertAll with 10,000 takes only 200ms longer than load. A more realistic comparison is insertAll doing insertions of 1000 points at a time and that takes 1500ms which is slower but still waaaaaaay better than the performance of inserting one-by-one (25 seconds).

Sounds great, let me know if I can assist you in any way.

rorystephenson commented 1 year ago

Unfortunately I only have a very basic understating of R-Tree and K-D-Tree so I feel like I'm currently not able to grasp any deeper explanation. My naive approach would have been that there is a Tree for each zoom level that contains the elements/clusters. Now when adding a new element one could go from top to bottom or vice versa. Something like: for each zoom level "check if the new element fits to a cluster that's already present". If yes then put it into the cluster and go deeper, if no then create a new cluster / insert the element. But I guess there are a lot of problems I currently can't see due to the lack of deeper understanding. Especially the last part "new cluster / insert the element" probably causes problems.

Right this was essentially what the old insertion did but the problem is that inserting can actually impact clusters surprisingly far from the inserted point. I will try to explain it here, partly for myself when I forget and come back to this in the future :).

The supercluster algorithm is greedy, at each zoom it iterates through the points as follows:

Take the first point/cluster (element) and search for other nearby elements.
If there are not enough nearby elements to form a cluster just add the element at this zoom. If there are enough nearby elements form a cluster with a weighted center calculated by the average position of the elements weighted by how many points they contain.
Add that new cluster at this zoom.

Since The cluster is formed by searching around the element that is being iterated it is possible that some of the other points in the cluster are within clustering range of other elements but those elements are not added to the cluster due to the greedy nature of the algorithm (which is what makes it fast). The important thing to understand here is that some elements in a cluster may be within clustering range of other elements at their zoom.

This is important when considering that clusters can actually split when an insertion is made. Imagine three points at zoom 10, A, B and C. A is on the left, B in the middle and C on the right. The points are 100m apart and at the given zoom the clustering range is 110m. When building the clusters at this zoom C is iterated first and forms a cluster with B, A remains on its own. Now you insert a new point, D, 100m to the right of C. D is iterated first and forms a cluster with C, leaving A and B on their own. The old insertion algorithm stopped here but we know that A and B are close enough to form a cluster, they didn't previously because B was in a cluster with C but that is no longer the case. The new algorithm handles this case and forms a new cluster with A and B.

In an unlikely scenario this can propagate further than just one cluster, imagine if you had points A to Y 100m apart and you add Z next to Y causing all of the clusters to change.

Sounds great, let me know if I can assist you in any way.

I have released 3.0.0-dev.1 if you want to give that a try!

Robbendebiene commented 1 year ago

Excellent explanation! Thanks for this.

I tested the new supercluster 3.0.0-dev.2 version and it seems to work as expected 🥳

rorystephenson commented 1 year ago

Awesome, I've now promoted it to 3.0.0. Thanks for the help!

rorystephenson / supercluster_dart

SuperclusterMutable.insert() and SuperclusterMutable.load() produce different results #5