pelias / schema

elasticsearch schema files and tooling
MIT License
40 stars 76 forks source link

feat(iso2): add a country_a2 field for iso2 codes #469

Closed blackmad closed 3 years ago

blackmad commented 4 years ago

currently we only index iso3 codes in country_a, which is a problem in autocomplete for countries where the iso2 code isn't a prefix of the iso3, like China (CHN/CN) or Mexico (MEX/MX) meaning the query "beijing cn" fails in autocomplete.

It's not a super common real world case for users (since they should hit the result by earlier) but does come up a fair bit in testing and looks embarasssing.

iso2 queries seem to work fine in geocoding, I think because slop lets us ignore them?

This will be paired with a change to pelias-model to automatically add iso2s.

Open to other approaches.

missinglink commented 4 years ago

I'd prefer to avoid adding another field if possible since it adds complexity to the schema & query logic.

The disadvantage of adding more tokens to the existing country_a field is that it will affect the scoring, for example USA/US or NZD/NZ will match twice on the alpha2 tokens since they share a common prefix.

[edit] the above is not correct for keyword type fields.

Since that field doesn't really need, or greatly benefit from, the TF/IDF type scoring we can probably just disable scoring completely for that one field using either a custom similarity or more simply, using index-options?

orangejulius commented 4 years ago

Yeah, I didn't realize when I threw out the "two fields" idea that we were already using a keyword field for country_a. It should have the behavior we want even with multiple ISO code variations indexed into a single field. 🤞

missinglink commented 4 years ago

On the topic of using multiple fields to mitigate 'double scoring' (for future discussions), I'm not convinced that adding a second field can actually mitigate the issue.

We can use a multi_match query to search against both fields for the same term but I think cross_fields and also potentially best_fields will still incur a term frequency scoring penalty.

We can wrap a subquery in a constant_score but that would work equally well for both a single field as it would for two fields.

blackmad commented 4 years ago

Okay I can yank all these PRs down and go back to trying to add synonyms to the field in the index

On Mon, Nov 2, 2020 at 9:26 PM Peter Johnson notifications@github.com wrote:

@missinglink commented on this pull request.

In synonyms/custom_admin.txt https://github.com/pelias/schema/pull/469#discussion_r516396744:

@@ -23,3 +23,4 @@

foo => foo bar, baz

#

=============================================================================

+GB => UK

This should be "GB,UK" (with a comma) as per synonyms/country_a2/country_a2.txt.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pelias/schema/pull/469#pullrequestreview-522156274, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADMZMG4TJJU5OPWUY7P4LTSN5S4NANCNFSM4S7PAR7A .

-- David Blackman creative technologist & wandering help me find my purpose http://purpose.blackmad.com

blackmad commented 4 years ago

@missinglink why do you say it's indexed as type keyword?

In our index at least, it looks like type=text

country_a: { type: "text", similarity: "peliasDefaultSimilarity", fields: { ngram: { type: "text", similarity: "peliasDefaultSimilarity", analyzer: "peliasIndexOneEdgeGram", search_analyzer: "peliasAdmin" } }, analyzer: "peliasAdmin" },

orangejulius commented 4 years ago

@blackmad looks like you're right:

https://github.com/pelias/schema/blob/384e3843082409781b6d8de0653e49dfc24e7cc8/mappings/document.js#L90

https://github.com/pelias/schema/blob/384e3843082409781b6d8de0653e49dfc24e7cc8/mappings/partial/admin.json#L2-L5

@missinglink maybe we should look at changing that? I don't think we've changed the mappings for the *_a fields in quite some time. Maybe we can do better now?

missinglink commented 4 years ago

That is so weird, I remember looking through the code and now looking through it again you're right, maybe I'm losing my marbles!

blackmad commented 4 years ago

we are all. covidbrain.

On Wed, Nov 18, 2020 at 8:07 PM Peter Johnson notifications@github.com wrote:

That is so weird, I remember looking through the code and now looking through it again you're right, maybe I'm losing my marbles!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pelias/schema/pull/469#issuecomment-730056635, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADMZMAQMOZV7N2SY6XGBQTSQRVWRANCNFSM4S7PAR7A .

-- David Blackman creative technologist & wandering help me find my purpose http://purpose.blackmad.com