pelias / api

HTTP API for Pelias Geocoder
http://pelias.io
MIT License
221 stars 162 forks source link

simplify deduplication logic #1607

Closed missinglink closed 2 years ago

missinglink commented 2 years ago

this PR greatly simplifies the deduplication logic.

The main difference is that previously we were focussing on keeping the superior rows and now we're focussing on removing the inferior rows, a subtle but important distinction!

In particular the new logic makes one positive functional change:

in the case where A dedupes B and B dedupes C, prior to this PR the C record may or may not remain in the results, depending on the order, due to B being present in the 'skip list'.

the unit test included demonstrates this ^ as it fails prior to this PR

missinglink commented 2 years ago

This PR improves the deduplication of https://pelias.github.io/compare/#/v1/autocomplete?text=berlin

missinglink commented 2 years ago

Good-to-go! I think this is a positive change which should help reduce some of the noisier deduplication cases that have been plaguing us for some time now 🎉

missinglink commented 2 years ago

housekeeping: the dedupe-without-skiplist branch can probably be deleted after merging this.

orangejulius commented 2 years ago

Cool, I really love the simplicity of this logic. Even if it's technically more computationally expensive (though I would imagine the effect is not measurable), it's much much easier to understand.

I do think the case you mentioned with transitive deduplication has prevented a lot of our deduplication checks from performing as we expect.

My guess is this is good to merge but let me try it out with some of the other recent changes I've been experimenting with, like #1609 and https://github.com/pelias/api/pull/1606.

orangejulius commented 2 years ago

Okay, after some fairly complex testing I can confirm this PR improves deduplication, or more precisely it allows some of the newer deduplication strategies we are experimenting with to be more effective.

Here's a summary of some results from testing with https://github.com/pelias/acceptance-tests/pull/557

branch Passing test count
master 2
A: simplified dedupe (this PR) 2
B: dedupe-placetype-in-name (https://github.com/pelias/api/pull/1371) 3
C: concordance-dedupe (https://github.com/pelias/api/pull/1606) 3
A + B 3
A + C 3
B + C 4
A + B + C 5

The interesting thing to note is that while this PR by itself does not change the test results, nor does it change the results when combined with either of our two other deduplication PRs, when all three PRs are combined, there is an improvement over not using this PR.

So that seems to confirm to me the idea that transitive deduplication is important.

One last thing to test is that this PR (as well as a combination of this and the other PRs) does not introduce regressions. That info will be coming soon, then I think we are good to go.