disable field norms - Githubissues

missinglink commented 3 years ago

this DRAFT PR isn't meant to be merged, I'm just curious as to what a planet build would look like with norms: false on all the fields.

it's been a while since we last looked at this in https://github.com/pelias/schema/pull/323

I suspect that since setting norms: false will disable 'field length', it will:

fix the issue we have with aliases counting towards the field length and therefore scoring lower when more aliases exist
at the same time will have a negative impact on exact matching queries where the shorter field length allowed them to score higher

the thing I'm curious about is how much effect the second point has in practice, there is actually an integration test which regresses as part of this commit but I suspect that population / popularity scoring may, to some degree, resolve some of the exact matching issues.

my hope it that it shows that this could potentially be workable, although I'm not willing to bet on it 😆

see: http://makble.com/what-is-lucene-norms

missinglink commented 3 years ago

I was expecting the build size to be reduced since it's not storing the 1 byte per document with the norms. It's not significant compared to the rest of the index:

missinglink commented 3 years ago

Some examples of improvements, in both cases the more popular, yet wordier names are now being scored higher than the exact matching or succinct names.

note: 'Angkor Wat Putt' is a mini-golf ;) it's actually got a popularity score of 6600, compared to the ticket office 2200, this is not great but it's not the fault of the similarity algo, we can fix that either in the data or the population calculation algo

More testing to come...

missinglink commented 3 years ago

So surprisingly the testing was fairly favourable, as expected it had the positive effect of fixing the field length scoring discrepancy introduced by adding aliases, and produced better sorting in many autocomplete cases with few regressions there.

For /v1/search and /v1/search/structured specifically I don't think it's necessarily all roses, the query /v1/search/structured?neighbourhood=Chelsea used to return Chelsea, London, England, United Kingdom first and now is returning Chelsea Heights, Atlantic City, NJ, USA first.

While this is kinda what I thought we wanted (because the USA result has a higher population). Upon reflection I don't think this is the behaviour we want from the /v1/search*** endpoints. I think for those we want to favour exact matches higher because the user asked for Chelsea not Chelsea%.

My current thesis:

"field length is an important tool for scoring exact matches better" but also "autocomplete by nature doesn't always favour exact matches and so maybe field length is less/not important there"

I've pushed a second commit which only sets norms: false on ngram fields, let's see what that looks like.

missinglink commented 3 years ago

This is one more screenshot of the dev build with norms=false on all fields, the query is /v1/autocomplete?text=statue of liberty:

missinglink commented 3 years ago

I put the newer build on dev (this is the build which only disabled norms on the ngram fields, not the other ones) and there's no noticeable difference from master.

This is pretty much what I was suspecting because the ngram indices are usually only used for the last token entered, so the 'damage has been done' already by that point.

pelias / schema

disable field norms #476