disable term frequencies for admin abbreviation fields

pelias / schema

elasticsearch schema files and tooling

MIT License

40 stars 75 forks source link

disable term frequencies for admin abbreviation fields #471

Closed missinglink closed 3 years ago

missinglink commented 3 years ago

This PR changes the way parent.*_a fields are indexed so that the term frequencies are not stored. the effect of this change is that indexing multiple terms in the field (ie. what we call 'aliases') will have no adverse effect on scoring.

Merging this work will allow us to follow up with a alpha3<>alpha2 country code mapping (for parent.country_a) without worrying about the additional tokens adversely affecting scoring.

orangejulius commented 3 years ago

So, looking at the index_options documentation, this is a pretty simple change :)

Code-wise, it looks good, the admin_abbreviation definition being built on the admin one is especially nice.

What do you think next steps should be to test this?

As I recall, we care most about how this affects autocomplete queries for addresses, right? In that case there's probably no getting around a full planet build.

Alternatively if we care mostly about queries like city, country_code (Amsterdam, NL vs Amsterdam, NLD for example), then maybe we can get away with a WOF only build?

missinglink commented 3 years ago

It should only have an effect on documents which have multiple terms indexed in an abbreviation field. AFAIK that doesn't currently happen anywhere at the moment so I'm expecting it to be a no-op.

In the case where we do actually have multiple terms indexed in an abbreviation field it would change the scoring slightly, in a positive way.

Essentially it should be a NOOP

orangejulius commented 3 years ago

Ok, good news is that the changes in #472 do fix the country code issues we've been going after.

Bad news is they break the structured geocoding endpoint, since they run phrase queries across fields that include the abbreviation fields, like this:

{
    "multi_match": {
        "query": "United States",
        "type": "phrase",
        "fields": [
            "parent.dependency",
            "parent.dependency_a"
        ]
    }
}

So we have to find either some query or schema changes to handle that.

Here's a question: if we are using synonyms to handle the 2/3 letter country codes, do we need to disable the field length at all? My recollection is that if we tell elasticsearch to index the data "MEX", for example, and then there is a synonym "MEX,MX", the field length will still be 1. whereas if we indexed the data "MEX MX", that wouldn't be the case.

missinglink commented 3 years ago

I'm going to abandon this PR, it seems that some queries I hadn't considered were using phrase type queries against these fields, as long as that functionality is required this solution is unviable.