pelias / schema

elasticsearch schema files and tooling
MIT License
40 stars 76 forks source link

apply unidirectional synonyms at query-time #411

Open missinglink opened 4 years ago

missinglink commented 4 years ago

as of today we finally removed all unidirectional synonyms (ones using the a=>b syntax) from our default synonyms file 🎉

unfortunately, I realized that there is a bug which is preventing those unidirectional synonyms from working properly when users specify them in a custom configuration.

as per the example below, it's possible to index the term "hello" and then not be able to retrieve the document using the term "hello" 🤔

the solution to this problem is to split all the synonyms into two buckets, one for unidirectional synonyms (a=>b syntax) and one for bidirectional synonyms (a,b syntax), we will then need to apply both buckets at index-time and only the unidirectional synonyms at query-time.

curl -s -XDELETE "http://localhost:9200/foo?pretty=true"

curl -s -XPUT "http://localhost:9200/foo?pretty=true" \
  -H 'Content-Type: application/json' \
  -d '{
      "settings" : {
        "analysis": {
          "filter" : {
            "mySynonym" : {
              "type" : "synonym",
              "synonyms" : [
                "hello => world"
              ]
            }
          },
          "analyzer": {
            "myAnalyzer": {
              "type": "custom",
              "tokenizer": "standard",
              "filter": [
                "mySynonym"
              ]
            }
          }
        }
      },
      "mappings" : {
        "_doc" : {
          "properties" : {
            "field1": {
              "type": "text",
              "analyzer": "myAnalyzer",
              "search_analyzer": "standard"
            }
          }
        }
      }
    }'

curl -s -XPOST "http://localhost:9200/foo/_doc/example?pretty=true" \
  -H 'Content-Type: application/json' \
  -d '{
      "field1": "hello"
    }'

curl -s -XPOST "http://localhost:9200/foo/_refresh?pretty=true"

curl -XGET "http://localhost:9200/foo/_search?pretty=true" \
  -H 'Content-Type: application/json' \
  -d '{
      "query": {
        "match": {
          "field1": "hello"
        }
      }
    }'
missinglink commented 4 years ago

a workaround, for now, is to duplicate the token from the left side of the => on the right side as such:

hello => hello, world
orangejulius commented 4 years ago

So we've now done this for the name field, and the address_parts.street field with https://github.com/pelias/api/pull/1444. Are there other fields we should do the same for, or is this all done?

missinglink commented 4 years ago

This is only really relevant for custom user-defined synonyms and doesn't affect stock-standard Pelias.

So if a user added a synonym foo => bar in custom_name for instance then all instances of 'foo' at index-time would be replaced by 'bar' yet at query-time there is no such replacement, meaning the doc doesn't match a query that is verbatim the same as what was in the source data.

Let's leave this open for now so we remember, I'll try and fix it at some point but it's a relatively low priority because it may not even affect anyone!

missinglink commented 4 years ago

One totally valid fix is just to say we don't support the => syntax at all, or that we warn anyone who uses it.