Open missinglink opened 4 years ago
These articles are great, especially in the second one:
Some examples of existing TF/IDF scoring I don't like (which I'll update over time):
search term | indexed term | description |
---|---|---|
"mcdonalds" | "mcdonalds - mcdonalds cafe" | TF causes duplicate terms to rank higher |
"ave" | "avenue" (ngrams) | IDF of ngrams saturated by prefixes of other terms |
"new york" | "new york" (city) "new york" (state) |
multi-field matches rank higher in boolean SHOULD mode & using multi_match (in cross_fields mode) |
Hi @missinglink we've found IDF to be useful in cases where documents contain words identifying their category. Let's say we have a query hotel park
where the user is searching for a hotel called Park Inn
. Some hotels might prefix their name with the word Hotel
, while some might not. Assuming these are some of the documents retrieved:
hotel
due to its commonality. Same applies to other categorical names such as Café
.Also, in our bayesian optimization experiments, we've generally found higher values of k1
(around 5) and very low values of b
(around 0.05) to perform really well. We haven't run enough experiments to claim statistical significance though.
I merged https://github.com/pelias/schema/pull/430 a few months back so it should be available on all indices built this year. In that PR I demonstrated how to close an index, change the setting and re-open it, so it's now easier than ever to experiment with this. It does not require re-indexing the data 🚀
I did a little experimenting today and confirmed its pretty easy to modify similarity algorithm parameters on a development cluster.
I also confirmed that setting k1
to 0 removes any impact of the number of matches of a term on the score. This means it effectively eliminates issues such as https://github.com/pelias/pelias/issues/849.
However, it also removes any impact of a document's length on the score, and as a result it's probably too drastic of a change for our needs. We should definitely investigate further.
A little more experimentation today confirms k1
at 0 is not going to be a cure all.
Here's an autocomplete query for United States:
Obviously, both documents match the tokens united
and states
:
"name": {
"default": [
"Mexico",
"México",
"Estados Unidos Mexicanos",
"Méjico",
"Nueva España",
"República Mejicana",
"el país azteca",
"Mexican",
"Mexican Republic",
"United Mexican States"
],
"name": {
"default": [
"United States",
"America",
"American",
"United States of America",
"Unitedstates",
"Unitedstatesofamerica"
],
frequency of 'united' | field length | |
---|---|---|
Mexico | 1 | 20 |
United States | 3 | 12 |
Elasticsearch reports the field length different than how I counted it, but aside from that oddity, it's clear what's happening. Previously the multiple matches and shorter field length would provide a healthy scoring advantage to USA. Now both of those are gone, they both match equally well. Population boosts are at the max of 20 for each record, so their total scores are identical as well.
Here's how Elasticsearch describes its calculation of term frequency:
"tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:"
Setting k1
to zero essentially always makes tf
1, regardless of the length of matching fields or the number of times the desired token is in the field.
Setting b
to zero might be better: it would ignore the length of the field, but prefer more frequent matches: freq / (freq + k1)
will grow with freq as k1 is held constant.
However it seems like based on all this it's not worth it to explore extreme values of k1
and b
, and instead we should try a couple more reasonable values and then move on to other optimizations. Personally I feel like no customization of these parameters is going to be a huge breakthrough.
Here's the full explain output for this query for further investigation: usa-explain.txt
Today I tested out reducing k1
to 1
and b
to 0.05
The idea with this less extreme configuration is that it would mostly eliminate penalties due to a longer field length (https://github.com/pelias/pelias/issues/862), but not change scoring too drastically otherwise.
Again, there were pretty mixed results. Looking at the Search POI tests, there were nice improvements for some airport queries:
(this is output from diff
, so the way to read this is to look at lines starting with -
as "before" and +
as after the change in scoring parameters. So for example, because there is more green on the +
line for the "Phoenix international airport" query, that means the correct result comes up much sooner)
However, some very basic autocomplete queries broke, most notably San Francisco:
Looking into the explain
output for this query, things are quite complicated, and there are lots of issues we should improve related to this query: San Francisco should really get a popularity boost, the alt names from Who's on First don't seem ideal, we should do a better job of de-duplicating similar alt names, and also the query generated the API is not that great. It looks like the pelias parser is working against us too:
"parsed_text": {
"subject": "san",
"locality": "san",
"admin": "franci"
},
However, through all of that, the underlying issue of scoring a single field with multiple alt names remains.
Here are the name.default
values for the San Francisco and South San Francisco records:
"South San Francisco",
"S S Fran",
"S S Francisco",
"S S. Fran",
"S S. Francisco",
"S San Fran",
"S San Francisco",
"South S Francisco",
"South S. Francisco",
"South San Francis",
"Southsanfrancisco",
"Ssf"
"San Francisco",
"S Francisco",
"S. Francisco",
"SFO",
"Sanfran",
"Sanfrancisco",
"Frisco"
There's a lot of issues with both, but in this case it looks like the main problem is that for the substring francis
South SF matches 7 times but SF matches only 3.
Fundamentally it looks like this exposes the tradeoff between solving https://github.com/pelias/openstreetmap/issues/507 and https://github.com/pelias/pelias/issues/862. If we penalize records with high term lengths, we lose some important results from records with lots of valid alt names. If we don't, we allow unimportant results that happen to have near-identical alt names to be boosted far too high.
One possibility worth mentioning is to create a custom similarity
(in Java) which adjusts the lengthNorm
calculation.
Disabling/enabling 'norms' is already possible via norms: false
but we could potentially improve on that by inspecting the FieldInvertState
passed to lengthNorm()
and modifying how it is calculated.
For example, the public float lengthNorm(FieldInvertState state)
function could only return the longest alias in the field by looking for the position-increment-gap
holes in the positions.
So instead of returning the 'number of words indexed in the field for this doc' it would return the 'longest number of words with consecutive positions indexed in the field for this doc'.
For years now we've been fighting the
TF/IDF
algorithm and more recently we've changed to theBM25
similarity algo which is much better for short texts like ours but it's still not perfect.There is a really great article here which talks about the caveats of scoring short title fields.
The cool thing about BM25 (and other similarity algos) is that they have some tunable parameters, albeit considered 'expert settings'.
One setting that interests me, in particular, is the k1 value which "Controls non-linear term frequency normalization (saturation).".
The default settings for
BM25
arek1=1.2
andb=0.75
, which are really nice settings for general use of elasticsearch, they work well for short fields like titles as well as for large fields like indexing a whole chapter of a book.For geocoding specifically we almost exclusively deal with short strings (<50 chars). I also personally feel that Term Frequencies are much less relevant for geocoding because they can cause issues like this.
I'd like to open this up to @pelias/contributors to discuss introducing our own custom similarity configuration (or multiple if required). In particular, I would like to investigate the effects of setting
k1=0
(or very very low).Thoughts?