pelias / schema

elasticsearch schema files and tooling
MIT License
40 stars 76 forks source link

disable field norms #476

Open missinglink opened 3 years ago

missinglink commented 3 years ago

this DRAFT PR isn't meant to be merged, I'm just curious as to what a planet build would look like with norms: false on all the fields.

it's been a while since we last looked at this in https://github.com/pelias/schema/pull/323

I suspect that since setting norms: false will disable 'field length', it will:

the thing I'm curious about is how much effect the second point has in practice, there is actually an integration test which regresses as part of this commit but I suspect that population / popularity scoring may, to some degree, resolve some of the exact matching issues.

my hope it that it shows that this could potentially be workable, although I'm not willing to bet on it 😆

see: http://makble.com/what-is-lucene-norms

missinglink commented 3 years ago

I was expecting the build size to be reduced since it's not storing the 1 byte per document with the norms. It's not significant compared to the rest of the index:

Screenshot 2021-02-24 at 09 33 13
missinglink commented 3 years ago

Some examples of improvements, in both cases the more popular, yet wordier names are now being scored higher than the exact matching or succinct names.

Screenshot 2021-02-24 at 10 14 21 Screenshot 2021-02-24 at 10 13 31

note: 'Angkor Wat Putt' is a mini-golf ;) it's actually got a popularity score of 6600, compared to the ticket office 2200, this is not great but it's not the fault of the similarity algo, we can fix that either in the data or the population calculation algo

More testing to come...

missinglink commented 3 years ago

So surprisingly the testing was fairly favourable, as expected it had the positive effect of fixing the field length scoring discrepancy introduced by adding aliases, and produced better sorting in many autocomplete cases with few regressions there.

For /v1/search and /v1/search/structured specifically I don't think it's necessarily all roses, the query /v1/search/structured?neighbourhood=Chelsea used to return Chelsea, London, England, United Kingdom first and now is returning Chelsea Heights, Atlantic City, NJ, USA first.

While this is kinda what I thought we wanted (because the USA result has a higher population). Upon reflection I don't think this is the behaviour we want from the /v1/search*** endpoints. I think for those we want to favour exact matches higher because the user asked for Chelsea not Chelsea%.

My current thesis:

"field length is an important tool for scoring exact matches better" but also "autocomplete by nature doesn't always favour exact matches and so maybe field length is less/not important there"

I've pushed a second commit which only sets norms: false on ngram fields, let's see what that looks like.

missinglink commented 3 years ago

This is one more screenshot of the dev build with norms=false on all fields, the query is /v1/autocomplete?text=statue of liberty:

Screenshot 2021-02-25 at 22 55 35
missinglink commented 3 years ago

I put the newer build on dev (this is the build which only disabled norms on the ngram fields, not the other ones) and there's no noticeable difference from master.

This is pretty much what I was suspecting because the ngram indices are usually only used for the last token entered, so the 'damage has been done' already by that point.