pelias / schema

elasticsearch schema files and tooling
MIT License
40 stars 75 forks source link

british <-> american english synonyms #467

Closed blackmad closed 4 years ago

blackmad commented 4 years ago

This was motivated by someone searching for "Marina Theater" when the correct name of the POI is "Marina Theatre" - by adding aliases at index time this should fix such searches in both autocomplete & search.

missinglink commented 4 years ago

The only real negative I can see with this is that it might have a performance penalty at index-time, so it's relatively safe to merge, although I can't see half of these words being used in place names?

Some thoughts:

  1. The list is very general purpose, many of the words are verbs and adjectives which are less common in place names compared with nouns, eg:
"cannibalise,cannibalize",
"cannibalised,cannibalized",
"cannibalises,cannibalizes",
"cannibalising,cannibalizing",
  1. From what I could tell, most of the differences are a Levenshtein Distance of one apart, usually due to the use of a different vowel. Or.. (mostly in British English) using two vowels in the place of one for the equivalent US English, or interchangeable use of 'z' and 's' as per the example above.

If this is the case then spelling-correction would to some degree solve the same problem.

  1. There are some words which are totally different depending on the culture, such as "Garbage/Trash/Rubbish", "Sweets/Candies/Lollies" etc. I don't think we really want to have these are pure synonyms? I'm on the fence whether "Pub" is the same as a "Bar"? "Post Office/Mail Office"... not sure about all these?

[edit] actually I had another look at the list and none of my examples in 3. were actually listed!?

missinglink commented 4 years ago

I like the idea but I'm not in love with the list generated from this npm module TBH

blackmad commented 4 years ago

I'll try to pare this list down by limiting it to words that are only small edit distances apart + actually appear in OSM

On Wed, Sep 23, 2020 at 8:33 AM Peter Johnson notifications@github.com wrote:

I like the idea but I'm not in love with the list generated from this npm module TBH

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pelias/schema/pull/467#issuecomment-697333782, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADMZMCPNQG2V6EBZXTVOMLSHHTK5ANCNFSM4RV7V4HA .

-- David Blackman creative technologist & wandering help me find my purpose http://purpose.blackmad.com

orangejulius commented 4 years ago

Perfect, I agree with Peter that this list is almost certainly far too long. I'd be happier with a list of at most 500 words, and if we cross reference with OSM data then that's even better.

blackmad commented 4 years ago
missinglink commented 4 years ago

Ready to merge, check up to date first

orangejulius commented 4 years ago

We need a Conventional Commit commit message as well so that this generates proper releases and changelog notes as well (maybe there is a GitHub Action to test for these, since we're not very good at remembering?)

missinglink commented 4 years ago

I've added a feat() commit message via squash