Closed missinglink closed 3 years ago
So, looking at the index_options documentation, this is a pretty simple change :)
Code-wise, it looks good, the admin_abbreviation
definition being built on the admin
one is especially nice.
What do you think next steps should be to test this?
As I recall, we care most about how this affects autocomplete queries for addresses, right? In that case there's probably no getting around a full planet build.
Alternatively if we care mostly about queries like city, country_code
(Amsterdam, NL
vs Amsterdam, NLD
for example), then maybe we can get away with a WOF only build?
It should only have an effect on documents which have multiple terms indexed in an abbreviation field. AFAIK that doesn't currently happen anywhere at the moment so I'm expecting it to be a no-op.
In the case where we do actually have multiple terms indexed in an abbreviation field it would change the scoring slightly, in a positive way.
Essentially it should be a NOOP
Ok, good news is that the changes in #472 do fix the country code issues we've been going after.
Bad news is they break the structured geocoding endpoint, since they run phrase queries across fields that include the abbreviation fields, like this:
{
"multi_match": {
"query": "United States",
"type": "phrase",
"fields": [
"parent.dependency",
"parent.dependency_a"
]
}
}
So we have to find either some query or schema changes to handle that.
Here's a question: if we are using synonyms to handle the 2/3 letter country codes, do we need to disable the field length at all? My recollection is that if we tell elasticsearch to index the data "MEX", for example, and then there is a synonym "MEX,MX", the field length will still be 1. whereas if we indexed the data "MEX MX", that wouldn't be the case.
I'm going to abandon this PR, it seems that some queries I hadn't considered were using phrase
type queries against these fields, as long as that functionality is required this solution is unviable.
This PR changes the way
parent.*_a
fields are indexed so that the term frequencies are not stored. the effect of this change is that indexing multiple terms in the field (ie. what we call 'aliases') will have no adverse effect on scoring.Merging this work will allow us to follow up with a alpha3<>alpha2 country code mapping (for
parent.country_a
) without worrying about the additional tokens adversely affecting scoring.related: