pelias / api

HTTP API for Pelias Geocoder
http://pelias.io
MIT License
221 stars 162 forks source link

deduplication - add isEquivalentIdentity() method #1580

Open missinglink opened 2 years ago

missinglink commented 2 years ago

This PR adds a new method to helper/diffPlaces.js called isEquivalentIdentity().

Detecting Equal Identity is trivial, if two records have the same GID they are the same thing... but there is also a case where record A has an equivalent identity to the parent of record B at the same layer.

An example is something like this, these two records are distinct:

A = { "source": "geonames", "source_id": 1, layer: "region" }
B = { "source": "whosonfirst", "source_id": 999, layer: "region" }

... however with this PR the following two records are considered equivalent:

A = { "source": "geonames", "source_id": 1, layer: "region",
  parent: {
    "region_id": ["999"],
    "region_source": ["whosonfirst"]
  }
}
B = { "source": "whosonfirst", "source_id": 999, layer: "region" }

This is particularly relevant for inter-source deduplication as, for example, a geonames 'locality' will have a WOF 'locality' in its parent hierarchy (from PIP). In that case we can dedupe the WOF record with the GN record containing the WOF record as a parent.

note: this is backwards compatible with indices prior to https://github.com/pelias/schema/pull/459 since an empty or absent *_source will be considered to have the value 'whosonfirst'.

missinglink commented 2 years ago

this PR replaces https://github.com/pelias/api/pull/1554 which was much more complicated, this method is both easier to implement/test and also more accurate than string checking the various name fields.

orangejulius commented 2 years ago

Cool, this looks pretty easy to read and understand. And it definitely looks like a sound approach.

We might want to add a test with some real world data to our deduplication tests to make sure we handle it, right? Also, do you have any compare links handy that show off queries that should be improved by this PR?

orangejulius commented 2 years ago

Ah, bad news. The acceptance tests show a bunch of regressions from this PR, looks like we need to restrict the comparison to records at the same layer.

Otherwise, for example, a city called Asia in the continent of Asia can replace the record for the continent:

/v1/search?text=Asia

Screen Shot 2021-11-12 at 10 19 50 AM
missinglink commented 2 years ago

hmm.. it should already be doing that, I'll have to pick this up again next week 🤔

https://github.com/pelias/api/pull/1580/files#diff-19d27b5d3dcbc72dfec62360afe0dbc154cc7ff03a30be725d3a8a6dee3df5afR181-R185

missinglink commented 2 years ago

added via rebase

diff --git a/helper/diffPlaces.js b/helper/diffPlaces.js
index 840f0203..d11ef0ba 100644
--- a/helper/diffPlaces.js
+++ b/helper/diffPlaces.js
@@ -168,6 +168,9 @@ const parentGID = (item, layer) => {
  */
 function isEquivalentIdentity(item1, item2) {

+  // Both items must be on the same layer
+  if (item1.layer !== item2.layer) { return false; }
+
   // Generate a GID value for each item
   const gid1 = GID(item1);
   const gid2 = GID(item2);