Open missinglink opened 3 years ago
I was expecting the build size to be reduced since it's not storing the 1 byte per document with the norms. It's not significant compared to the rest of the index:
Some examples of improvements, in both cases the more popular, yet wordier names are now being scored higher than the exact matching or succinct names.
note: 'Angkor Wat Putt' is a mini-golf ;) it's actually got a popularity score of 6600, compared to the ticket office 2200, this is not great but it's not the fault of the similarity algo, we can fix that either in the data or the population calculation algo
More testing to come...
So surprisingly the testing was fairly favourable, as expected it had the positive effect of fixing the field length scoring discrepancy introduced by adding aliases, and produced better sorting in many autocomplete cases with few regressions there.
For /v1/search
and /v1/search/structured
specifically I don't think it's necessarily all roses, the query /v1/search/structured?neighbourhood=Chelsea
used to return Chelsea, London, England, United Kingdom
first and now is returning Chelsea Heights, Atlantic City, NJ, USA
first.
While this is kinda what I thought we wanted (because the USA result has a higher population). Upon reflection I don't think this is the behaviour we want from the /v1/search***
endpoints. I think for those we want to favour exact matches higher because the user asked for Chelsea
not Chelsea%
.
My current thesis:
"field length is an important tool for scoring exact matches better" but also "autocomplete by nature doesn't always favour exact matches and so maybe field length is less/not important there"
I've pushed a second commit which only sets norms: false
on ngram fields, let's see what that looks like.
This is one more screenshot of the dev build with norms=false
on all fields, the query is /v1/autocomplete?text=statue of liberty
:
I put the newer build on dev (this is the build which only disabled norms on the ngram
fields, not the other ones) and there's no noticeable difference from master
.
This is pretty much what I was suspecting because the ngram indices are usually only used for the last token entered, so the 'damage has been done' already by that point.
this DRAFT PR isn't meant to be merged, I'm just curious as to what a planet build would look like with
norms: false
on all the fields.it's been a while since we last looked at this in https://github.com/pelias/schema/pull/323
I suspect that since setting
norms: false
will disable 'field length', it will:the thing I'm curious about is how much effect the second point has in practice, there is actually an integration test which regresses as part of this commit but I suspect that
population
/popularity
scoring may, to some degree, resolve some of the exact matching issues.my hope it that it shows that this could potentially be workable, although I'm not willing to bet on it 😆
see: http://makble.com/what-is-lucene-norms