kgneng2 commented 3 years ago

https://kin3303.tistory.com/m/91?category=886715

기본적으로 Charfilter 에 의해 공백 콤마등의 문자는 제거 한다.

standard analyzer를 사용한다.

charfilter 이후에, tokenizer로 전달된다.
기본 standard tokenizer는 whitespace 기준으로 토큰을 자름.

standard tokenizer

공백 기반으로, 특수문자는 제외해서 진행

" I'm in the mood for drinking semi-dry red wine! "
=> [I'm,in,the,mood,for,drinking,semi,dry,red,wine]

letter tokenizer

char가 아닌 문자기반으로 tokenizing한다.

I'm in the mood for drinking semi-dry red wine! "=> [I,m,in,the,mood,for,drinking,semi,dry,red,wine]

lowercase => letter + 소문자 변경 진행

{
    "tokenizer" : "lowercase",
    "text" : " I'm in the mood for drinking semi-dry red wine! "
}

edge n gram tokenizer

이게 레알 쓸모있을듯, 자동완성 찾을때,
하지만 이건, 한글자 마다 해줌..... 즉 단어가 아니라 문자수로 진행한다.

shingle

ngram 단어버전 짱짱

req

GET /_analyze
{
  "tokenizer": "whitespace",
  "filter": [ "shingle" ],
  "text": "quick brown fox jumps"
}

"output_unigrams": false : 추가시 한단어 짜리는 안나옴. min_shingle_size, max_shingle_size만 포함

res

[ quick, quick brown, brown, brown fox, fox, fox jumps, jumps ]

token filter

위에서 tokenize를 하고 이제 filtering을 진행한다.

standard token filter

암것도안함..

stop token filter

의미있는단어 기반으로 token 필터
and , a , the 는 ㅃ2

Query

그냥 query 진행하면 된다 . 인덱스 생성시에만 tokenize, analyzer 설정하는것임.

kgneng2 commented 3 years ago

PUT /my_index/default/_mapping
{
  "properties": {
    "description": {
      "type":"text",
      "analyzer":"my_custom_analyzer"
    },
    "teaser":{
      "type":"text",
      "analyzer": "standard"
    }
  }
}

POST /my_index/default/1
{
  "description":":)",
  "teaser":":)"
}

GET /my_index/default/_search
{
  "query": {
    "match": {
      "description":"_happy_"
    }
  }
}

kgneng2 commented 3 years ago

es score 계산

relevance : es에서 score 계산하는것
TF-IDF, BM25 를 이용해서 계산한다.

TF-IDF

Field-length norm : 짧은 필드에 term은 긴필드 term보다 큰 weight

TF

문서 내 발생한 term 빈도수가 클수록 weight가 높음 , 문장에 여러번 단어가 증가하면 점수가 높음

                      "description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
                      "details": [
                        {
                          "value": 1,
                          "description": "termFreq=1.0",
                          "details": [

                          ]
                        },
                        {
                          "value": 1.2,
                          "description": "parameter k1",
                          "details": [

                          ]
                        },
                        {
                          "value": 0.75,
                          "description": "parameter b",
                          "details": [

                          ]
                        },
                        {
                          "value": 6.9182334,
                          "description": "avgFieldLength",
                          "details": [

                          ]
                        },
                        {
                          "value": 4,
                          "description": "fieldLength",
                          "details": [

                          ]
                        }

IDF

전체 문서에서 발생한 term 빈도수가 작을수록 weight가 높다. 문서에 자주 등장하는 단어면 점수가 낮음
문서에 많이 나오는게 좋은게 아닌가? 라고 생각할 수 있겠지만 문서에 공통적으로 많이 등장하는 단어는 실제 우리가 쓰는 단어로 살펴본다면 "은", "는", "다"처럼 형용사, 부사등이 되며 이는 실제로 큰 의미를 가지지 않을 확률이 높다.

"description" : idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
                      "details": [
                        {
                          "value": 448,
                          "description": "docFreq",
                          "details": [

                          ]
                        },
                        {
                          "value": 1012159,
                          "description": "docCount",
                          "details": [

                          ]
                        }
                      ]

TF * IDF 값이 최종 스코어를 의미한다.

Reference

https://ict-nroo.tistory.com/82

kgneng2 commented 3 years ago

auto complete 조사

Prefix query

Keyword vs text
Keyword exactly search : 앞에서 부터 차근차근
Text 는 tokenize search

Fuzzy query

levenshtein algorithm
https://github.com/renuevo/data-modeling-algorithm/tree/master/levenshtein-distance 참고 하면 좋을듯

Match Phrase Prefix

Pass

Combine Query

Fuzzy + Prefix
Should query를 사용하여 Fuzzy와 Prefix 중 한가지라도 조건이 맞으면 결과로 노출됩니다

ES autocomplete index 조사

GET _mapping
{
  "autocomplete_test_1" : {
    "mappings" : {
      "properties" : {
        "word" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword"
            }
          },
          "analyzer" : "linguist2_analyzer"
        }
      }
    }
  }
}

keyword 와 text를 서로 조합해서 mapping을 구성해야한다.

prefix 쿼리시

스팀게임 추 검색시, text로만 구성되면 검색이 되지않지만, keyword와 같이해준다면 검색이 가능하기 때문이다.
추천만 검색했을시 keyword는 불일치지만, text 일때 걸리기때문에 매핑이 가능함.
https://renuevo.github.io/elastic/autocomplete/elastic-autocomplete-1/

Suggest

두개의 방법이 있는거같다.

tokenize를 해서 filter 및 설정을 진행후, query matching을 통해서 진행한다. ( autocomplte 기능을 제공해줌.)

PUT autocomplete_test_2 { "settings": { "analysis": { "analyzer": { "autocomplete": { "tokenizer": "autocomplete", "filter": [ "lowercase" ] }, "autocomplete_search": { "tokenizer": "lowercase" } }, "tokenizer": { "autocomplete": { "type": "edge_ngram", "min_gram": 1, "max_gram": 20, "token_chars": [ "letter", "digit" ] } } } }, "mappings": { "properties": { "word": { "type": "text", "analyzer": "autocomplete", "search_analyzer": "autocomplete_search" } } } }


2. ES에서 제공해주는 suggest를 활용한다.

kgneng2 commented 3 years ago

Query

term : 형태소분석 tokenize 하지 않은 값을 매칭
match : 형태소분석 tokenize 하고 난 뒤 매칭

bool query

must : 쿼리가 참인 도큐먼트들을 검색합니다.
must_not : 쿼리가 거짓인 도큐먼트들을 검색합니다.
should : 검색 결과 중 이 쿼리에 해당하는 도큐먼트의 점수를 높입니다.
filter : 쿼리가 참인 도큐먼트를 검색하지만 스코어를 계산하지 않습니다. must 보다 검색 속도가 빠르고 캐싱이 가능합니다.

kgneng2 commented 3 years ago

mutli_match에 대해서

https://www.elastic.co/guide/en/elasticsearch/reference/1.4/query-dsl-multi-match-query.html#operator-min
multi_match vs should query : stackoverflow
multi_match는 type을 정할수 있는데 기본값은 best_fields(필드중 가장높은 점수), should는 most_fields(score 조합) 이다.

cross_fields를 이용하면 여러필드를 하나의 필드로 보고 거기서 검색을 진행함.

For instance, when querying the first_name and last_name fields for “Will Smith”, the best match is likely to have “Will” in one field and “Smith” in the other.

operator and minimum_should_match

best field와 most field는 아래와 같은 결과가 나올 수 있다.

{
  "multi_match" : {
    "query":      "Will Smith",
    "type":       "best_fields",
    "fields":     [ "first_name", "last_name" ],
    "operator":   "and" 
  }
}

  (+first_name:will +first_name:smith)
| (+last_name:will  +last_name:smith)

모든용어가 단일필드에 있어야한다.. 그래야 점수가 높아욤.
best match는 first에 will이 있고 last에 smith가 있어야 되는데, best_field, most_field로는 불가능함
per-term 대신에 per-field
full name인 필드를 만들어서 해야하지만 cross field가 이를 해결해줌
minimum_should_match : 최소한으로 매칭되어야하는 필드에서, 수치값임.

첫 번째는 Most_field type은 field마다 Operator, minimum_should_match를 적용하지만 Cross_fields type은 Term마다 적용한다.

두 번째는 관련성이다. 만약 “Will Smith”이라는 이름을 검색한다고 생각해보자. 검색 시 ‘Will’, ‘Smith’이라는 두개의 Terms는 각 각의 last_name, first_name Field에 대해 검색할 것이다. 결과는 ”Smith Jones”가 “Will Smith” 보다 점수가 높을 것이다.

kgneng2 commented 3 years ago

es doc 참고

https://esbook.kimjmin.net/06-text-analysis/6.7-stemming/6.7.2-nori

kgneng2 commented 3 years ago

{
  "explain": true,
  "query": {
    "bool": {
      "must": {
        "match": {
          "CountryCode": "KR"
        }
      },
      "should": [
        {
          "terms": {
            "hotelId": [
              "4757718",
              "4943464",
              "4996410",
              "5257994",
              "5279852",
              "5281864",
              "5281890"
            ]
          }
        },
        {
          "match": {
            "placeKo": "화성시"
          }
        },
        {
          "multi_match": {
            "query": "스타즈호텔 동탄",
            "fields": [
              "nameKo",
              "featuredNameKo",
              "normNameKo",
              "baseNameKo"
            ],
            "type": "cross_fields",
            "minimum_should_match": "100%",
            "analyzer": "query_ko"
          }
        },
        {
          "multi_match": {
            "query": "staz hotel dongtan",
            "fields": [
              "nameEn",
              "featuredNameEn",
              "normNameEn",
              "baseNameEn"
            ],
            "type": "cross_fields",
            "minimum_should_match": "50%",
            "analyzer": "query_en"
          }
        }
      ]
    }
  }
}

이거어떠카냐..

kgneng2 commented 3 years ago

자동완성 매핑 참고

https://www.skyer9.pe.kr/wordpress/?p=1101

예시 sample

autocomplete index setting 예시..

{
  "settings": {
    "analysis": {
      "filter": {
        "hotel_synonym": {
          "type": "synonym",
          "synonyms_path": "analysis/hotel_synonym.txt"
        }
      },
      "analyzer": {
        "autocomplete": {
          "tokenizer": "autocomplete",
          "filter": [
            "lowercase",
            "hotel_synonym"
          ]
        },
        "autocomplete_search": {
          "tokenizer": "lowercase"
        }
      },
      "tokenizer": {
        "autocomplete": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 20,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "word": {
        "type": "text",
        "analyzer": "autocomplete",
        "search_analyzer": "autocomplete_search"
      }
    }
  }
}

kgneng2 commented 3 years ago

tokenizer 를 진행하면 일단 짜름 filter를 하면 tokenizer된거에대해서 map함수를 갈김

req

GET _analyze
{
  "tokenizer": "standard",

  "text": "메리어트 뉴욕"
}

res

{
  "tokens" : [
    {
      "token" : "메리어트",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "<HANGUL>",
      "position" : 0
    },
    {
      "token" : "뉴욕",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "<HANGUL>",
      "position" : 1
    }
  ]
}

req with filter

GET _analyze
{
  "tokenizer": "standard",
  "filter": [
    {
      "type": "edge_ngram",
      "min_gram": 1,
      "max_gram": 20
    }
  ],
  "text": "메리어트 뉴욕"
}

response

{
  "tokens" : [
    {
      "token" : "메",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "<HANGUL>",
      "position" : 0
    },
    {
      "token" : "메리",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "<HANGUL>",
      "position" : 0
    },
    {
      "token" : "메리어",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "<HANGUL>",
      "position" : 0
    },
    {
      "token" : "메리어트",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "<HANGUL>",
      "position" : 0
    },
    {
      "token" : "뉴",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "<HANGUL>",
      "position" : 1
    },
    {
      "token" : "뉴욕",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "<HANGUL>",
      "position" : 1
    }
  ]
}

kgneng2 commented 3 years ago

https://www.elastic.co/guide/en/elasticsearch/reference/current/analyzer.html

A search_analyzer setting for non-phrase queries that will remove stop words
A search_quote_analyzer setting for phrase queries that will not remove stop words

kgneng2 commented 3 years ago

단순 n_gram으로 테스트진행

{
  "autocomplete_test_2" : {
    "aliases" : { },
    "mappings" : {
      "properties" : {
        "word" : {
          "type" : "text",
          "analyzer" : "autocomplete",
          "search_analyzer" : "autocomplete_search"
        }
      }
    },
    "settings" : {
      "index" : {
        "number_of_shards" : "1",
        "provided_name" : "autocomplete_test_2",
        "creation_date" : "1619084078796",
        "analysis" : {
          "analyzer" : {
            "autocomplete" : {
              "filter" : [
                "lowercase"
              ],
              "tokenizer" : "autocomplete"
            },
            "autocomplete_search" : {
              "tokenizer" : "lowercase"
            }
          },
          "tokenizer" : {
            "autocomplete" : {
              "token_chars" : [
                "letter",
                "digit"
              ],
              "min_gram" : "1",
              "type" : "edge_ngram",
              "max_gram" : "20"
            }
          }
        },
        "number_of_replicas" : "1",
        "uuid" : "-8FO42iGTzyvqlEuurgttw",
        "version" : {
          "created" : "7100299"
        }
      }
    }
  }
}

query

GET autocomplete_test_2/_search
{
  "explain": true, 
  "query" :{
    "match": {
      "word": "여성 트레인"
    }
  }
}

여성속옷이 더 높은 결과를 도출, 이유는 "dl, length of field", 이 값 수치가 여성트레이닝복 보다 작기때문 오타로 인한(?) 또는 자모분리가 안된 ngram 분리일 경우 발생하는 현상입니다.

해결책은 fuziness를 도입하거나, 한글 자모 분리값을 이용해 보도록합니다.

kgneng2 commented 3 years ago

search analzyer

ngram으로 tokenizer 해놓으면 검색할때, 같은 analyzer로 되기때문에, 검색결과와 상이한걸 받을수 있음.


  | Analysis settings to define the custom autocomplete analyzer.
  | The text field uses the autocomplete analyzer at index time, but the standard analyzer at search time.
  | This field is indexed as the terms: [ q, qu, qui, quic, quick, b, br, bro, brow, brown, f, fo, fox ]
  | The query searches for both of these terms: [ quick, br ]

kgneng2 commented 3 years ago

https://www.elastic.co/guide/en/elasticsearch/reference/7.x/query-dsl-match-bool-prefix-query.html

match-bool-prefix

GET /_search
{
  "query": {
    "match_bool_prefix" : {
      "message" : "quick brown f"
    }
  }
}

GET /_search
{
  "query": {
    "bool" : {
      "should": [
        { "term": { "message": "quick" }},
        { "term": { "message": "brown" }},
        { "prefix": { "message": "f"}}
      ]
    }
  }
}

analyzer는 기본 매핑된 analyzer를 이용함니다.

kgneng2 / blokg

elasticsearch token analyzer #44

standard tokenizer

letter tokenizer

lowercase => letter + 소문자 변경 진행

edge n gram tokenizer

shingle

req

res

token filter

standard token filter

stop token filter

Query

es score 계산

TF-IDF

TF

IDF

Reference

auto complete 조사

Prefix query

Fuzzy query

Match Phrase Prefix

Combine Query

ES autocomplete index 조사

prefix 쿼리시

Suggest

Query

bool query

mutli_match에 대해서

operator and minimum_should_match

es doc 참고

자동완성 매핑 참고

autocomplete index setting 예시..

req

res

req with filter

response

단순 n_gram으로 테스트진행

query

search analzyer