pelias / api

HTTP API for Pelias Geocoder
http://pelias.io
MIT License
219 stars 162 forks source link

improving inter-layer deduplication #1557

Open missinglink opened 3 years ago

missinglink commented 3 years ago

The existing helper/diffPlaces::isLayerDifferent() function provides the ability to deduplicate between disparate layers in certain conditions.

It provides an Array called $layerPreferences which indicates which layers can be considered synonymous for deduplication purposes.

I think this this array may be causing a lot of the confusion when working with the deduplication code and potentially creating errors, the reason for this is that the list contains 9 terms and considers them all to synonymous with each other, so neighbourhood and empire can be deduped, for example, which seems wrong to me.

The dedupe.js logic is pretty complicated, once it finds a set of 'potential matches' then it selects a 'master' record and, considering the others as dupes, adds them to a skip list which excludes them from results.

Since the layerPreferences matching is quite broad, many places are added to the skip list which shouldn't really be there!

I had a quick play around with the code but didn't settle on a solution yet, I think the main improvement would be to have some sort of layer equivalency matrix which allows us to be more specific about exactly which layers on the layerPreferences lists can be combined which which other ones.

Eg:

diff --git a/helper/diffPlaces.js b/helper/diffPlaces.js
index 0094f212..b9804063 100644
--- a/helper/diffPlaces.js
+++ b/helper/diffPlaces.js
@@ -20,6 +20,25 @@ const layerPreferences = [
   'empire'
 ];

+const layerEquivalencyMatrix = [
+  [0,1,1,1,1,1,1,1,1],
+  [1,0,1,1,1,1,1,1,1],
+  [1,1,0,1,1,1,1,1,1],
+  [1,1,1,0,1,1,1,1,1],
+  [1,1,1,1,0,1,1,1,1],
+  [1,1,1,1,1,0,1,1,1],
+  [1,1,1,1,1,1,0,1,1],
+  [1,1,1,1,1,1,1,0,1],
+  [1,1,1,1,1,1,1,1,0]
+];
+
+const layersAreEquivalent = (layer1, layer2) => {
+  return _.get(layerEquivalencyMatrix, [
+    _.indexOf(layerPreferences, layer1),
+    _.indexOf(layerPreferences, layer2)
+  ]) === 1
+}
+
 /**
  * Compare the layer properties if they exist.
  * Returns false if the objects are the same, else true.
@@ -31,8 +50,8 @@ function isLayerDifferent(item1, item2){
         ( item2.layer === 'venue' || !_.includes( canonicalLayers, item2.layer ) ) ){
       return false;
     }
-    // consider some layers to be synonymous
-    if( _.includes( layerPreferences, item1.layer ) && _.includes( layerPreferences, item2.layer ) ){
+    // consider some layers to be equivalent
+    if (layersAreEquivalent(item1.layer, item2.layer)){
       return false;
     }
     return true;
orangejulius commented 3 years ago

Agreed, this would definitely be helpful. I think the list of "equivalent" layers has slowly grown well beyond what's reasonable.

I know there are lots of inter-layer deduplication cases to consider, but hopefully we could considerably reduce the scope of inter-layer deduping with a change like you proposed. The "widest" required deduping I can recall is deduping Luxembourg across locality, localadmin, region, and country, which is considerably more constrained than what we technically now allow for.

In terms of implementation, I'd just request we don't actually have a literal matrix of 0's and 1's (hopefully that doesn't need to be said 😆).

Maybe we keep things simple with an array of layer pairs that are equivalent, so something like this:

const equivalent_layers = [
  [ 'neighbourhood', 'locality'],
  [ 'locality', 'localadmin' ],
  // etc
];

We'd have to ensure our code handles things in either order (A locality is equivalent to a localadmin, and a localadmin is equivalent to a locality), I imagine, unless there's a reason we don't want that (I can't think of one).