pelias / schema

elasticsearch schema files and tooling
MIT License
40 stars 76 forks source link

discuss: similarity algorithms #408

Open missinglink opened 4 years ago

missinglink commented 4 years ago

For years now we've been fighting the TF/IDF algorithm and more recently we've changed to the BM25 similarity algo which is much better for short texts like ours but it's still not perfect.

There is a really great article here which talks about the caveats of scoring short title fields.

The cool thing about BM25 (and other similarity algos) is that they have some tunable parameters, albeit considered 'expert settings'.

One setting that interests me, in particular, is the k1 value which "Controls non-linear term frequency normalization (saturation).".

The default settings for BM25 are k1=1.2 and b=0.75, which are really nice settings for general use of elasticsearch, they work well for short fields like titles as well as for large fields like indexing a whole chapter of a book.

For geocoding specifically we almost exclusively deal with short strings (<50 chars). I also personally feel that Term Frequencies are much less relevant for geocoding because they can cause issues like this.

I'd like to open this up to @pelias/contributors to discuss introducing our own custom similarity configuration (or multiple if required). In particular, I would like to investigate the effects of setting k1=0 (or very very low).

Thoughts?

missinglink commented 4 years ago

These articles are great, especially in the second one:

missinglink commented 4 years ago

Some examples of existing TF/IDF scoring I don't like (which I'll update over time):

search term indexed term description
"mcdonalds" "mcdonalds - mcdonalds cafe" TF causes duplicate terms to rank higher
"ave" "avenue" (ngrams) IDF of ngrams saturated by prefixes of other terms
"new york" "new york" (city)
"new york" (state)
multi-field matches rank higher in boolean SHOULD mode
& using multi_match (in cross_fields mode)
shuaibiyy commented 4 years ago

Hi @missinglink we've found IDF to be useful in cases where documents contain words identifying their category. Let's say we have a query hotel park where the user is searching for a hotel called Park Inn. Some hotels might prefix their name with the word Hotel, while some might not. Assuming these are some of the documents retrieved:

shuaibiyy commented 4 years ago

Also, in our bayesian optimization experiments, we've generally found higher values of k1 (around 5) and very low values of b (around 0.05) to perform really well. We haven't run enough experiments to claim statistical significance though.

missinglink commented 4 years ago

I merged https://github.com/pelias/schema/pull/430 a few months back so it should be available on all indices built this year. In that PR I demonstrated how to close an index, change the setting and re-open it, so it's now easier than ever to experiment with this. It does not require re-indexing the data 🚀

orangejulius commented 4 years ago

I did a little experimenting today and confirmed its pretty easy to modify similarity algorithm parameters on a development cluster.

I also confirmed that setting k1 to 0 removes any impact of the number of matches of a term on the score. This means it effectively eliminates issues such as https://github.com/pelias/pelias/issues/849.

However, it also removes any impact of a document's length on the score, and as a result it's probably too drastic of a change for our needs. We should definitely investigate further.

orangejulius commented 4 years ago

A little more experimentation today confirms k1 at 0 is not going to be a cure all.

Here's an autocomplete query for United States: Screenshot_2020-05-08_14-44-29

Some stats from behind the scenes of the scoring

Obviously, both documents match the tokens united and states:

Mexico

          "name": {
            "default": [
              "Mexico",
              "México",
              "Estados Unidos Mexicanos",
              "Méjico",
              "Nueva España",
              "República Mejicana",
              "el país azteca",
              "Mexican",
              "Mexican Republic",
              "United Mexican States"
            ],

USA

     "name": {
            "default": [
              "United States",
              "America",
              "American",
              "United States of America",
              "Unitedstates",
              "Unitedstatesofamerica"
            ],

Relevant scoring values

frequency of 'united' field length
Mexico 1 20
United States 3 12

Elasticsearch reports the field length different than how I counted it, but aside from that oddity, it's clear what's happening. Previously the multiple matches and shorter field length would provide a healthy scoring advantage to USA. Now both of those are gone, they both match equally well. Population boosts are at the max of 20 for each record, so their total scores are identical as well.

Here's how Elasticsearch describes its calculation of term frequency:

"tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:"

Setting k1 to zero essentially always makes tf 1, regardless of the length of matching fields or the number of times the desired token is in the field.

Setting b to zero might be better: it would ignore the length of the field, but prefer more frequent matches: freq / (freq + k1) will grow with freq as k1 is held constant.

However it seems like based on all this it's not worth it to explore extreme values of k1 and b, and instead we should try a couple more reasonable values and then move on to other optimizations. Personally I feel like no customization of these parameters is going to be a huge breakthrough.

Here's the full explain output for this query for further investigation: usa-explain.txt

orangejulius commented 4 years ago

Today I tested out reducing k1 to 1 and b to 0.05 The idea with this less extreme configuration is that it would mostly eliminate penalties due to a longer field length (https://github.com/pelias/pelias/issues/862), but not change scoring too drastically otherwise.

Again, there were pretty mixed results. Looking at the Search POI tests, there were nice improvements for some airport queries: Screenshot_2020-05-11_11-51-38 (this is output from diff, so the way to read this is to look at lines starting with - as "before" and + as after the change in scoring parameters. So for example, because there is more green on the + line for the "Phoenix international airport" query, that means the correct result comes up much sooner)

However, some very basic autocomplete queries broke, most notably San Francisco: Screenshot_2020-05-11_11-20-18

Looking into the explain output for this query, things are quite complicated, and there are lots of issues we should improve related to this query: San Francisco should really get a popularity boost, the alt names from Who's on First don't seem ideal, we should do a better job of de-duplicating similar alt names, and also the query generated the API is not that great. It looks like the pelias parser is working against us too:

"parsed_text": {
                "subject": "san",
                "locality": "san",
                "admin": "franci"
            },

However, through all of that, the underlying issue of scoring a single field with multiple alt names remains.

Here are the name.default values for the San Francisco and South San Francisco records:

South San Francisco

        "South San Francisco",
        "S S Fran",
        "S S Francisco",
        "S S. Fran",
        "S S. Francisco",
        "S San Fran",
        "S San Francisco",
        "South S Francisco",
        "South S. Francisco",
        "South San Francis",
        "Southsanfrancisco",
        "Ssf"

San Francisco

        "San Francisco",
        "S Francisco",
        "S. Francisco",
        "SFO",
        "Sanfran",
        "Sanfrancisco",
        "Frisco"

There's a lot of issues with both, but in this case it looks like the main problem is that for the substring francis South SF matches 7 times but SF matches only 3.

Fundamentally it looks like this exposes the tradeoff between solving https://github.com/pelias/openstreetmap/issues/507 and https://github.com/pelias/pelias/issues/862. If we penalize records with high term lengths, we lose some important results from records with lots of valid alt names. If we don't, we allow unimportant results that happen to have near-identical alt names to be boosted far too high.

missinglink commented 3 years ago

One possibility worth mentioning is to create a custom similarity (in Java) which adjusts the lengthNorm calculation.

Disabling/enabling 'norms' is already possible via norms: false but we could potentially improve on that by inspecting the FieldInvertState passed to lengthNorm() and modifying how it is calculated.

For example, the public float lengthNorm(FieldInvertState state) function could only return the longest alias in the field by looking for the position-increment-gap holes in the positions.

So instead of returning the 'number of words indexed in the field for this doc' it would return the 'longest number of words with consecutive positions indexed in the field for this doc'.

https://stackoverflow.com/a/35901213/339280