skroutz / elasticsearch-analysis-greeklish

A greeklish token filter for elasticsearch
http://www.skroutz.gr/blog/team
48 stars 24 forks source link

Am I doing something wrong?? #2

Closed mixalistzikas closed 9 years ago

mixalistzikas commented 9 years ago

lower_greek and stem_greek works fine... But "greeklish".. doesn't work... Am I doing something wrong??

I'm trying to find "Dimitris" but never get name "Δημήτρης" to my results

{
"settings":{
    "analysis":{
        "analyzer":{
            "stem_analyzer":{
                "type":"custom",
                "tokenizer":"standard",
                "filter": ["lower_greek", "stem_greek","greeklish_analysis"]
            }
        },
        "filter": {
            "lower_greek": {
                "type":"lowercase",
                "language":"greek"
            },
            "stem_greek": {
                "type":"skroutz_stem_greek"
            },
            "greeklish_analysis":{
                "type": "greeklish",
                "max_expansions": 15
            }
        }
    }
},
"mappings": {
        "shooters": {
            "properties": {
                "name": {
                    "type": "string",
                    "analyzer": "stem_analyzer"
                }
            }
        }
    }
}
chief commented 9 years ago

hi, first of dimitris is a valid greeklish form, as we can see from here:

greeklish_words = converter.convert('δημητρης')
["dhmhtrhs", "dimhtrhs", "dhmitrhs", "dimitrhs", "dhmhtris", "dimhtris", "dhmitris", "dimitris"] 

What is your query analyzer? Version of ElasticSearch?

mixalistzikas commented 9 years ago

I did that and it works fine!!! It works only for version <=1.5 Could you fix the compatibility for the 1.5.1?

{
"settings":{
    "analysis":{
        "analyzer":{
            "greeklish_analyzer":{
                "type":"custom",
                "tokenizer":"standard",
                "filter": ["lower_greek","greeklish_analysis"]
            },
            "stem_analyzer":{
                "type":"custom",
                "tokenizer":"standard",
                "filter": ["lower_greek","stem_greek"]
            }
        },
        "filter": {
            "lower_greek": {
                "type":"lowercase",
                "language":"greek"
            },
            "stem_greek": {
                "type":"skroutz_stem_greek"
            },
            "greeklish_analysis":{
                "type": "greeklish",
                "max_expansions": 15
            }   
        }
    }
},
"mappings": {
        "shooters": {
            "properties": {
                "name": {
                    "type": "multi_field",
                    "fields": {
                        name: { type: "string", analyzer: "stem_analyzer" },
                        english:  { type: "string", analyzer: "english" },
                        greeklish: { type: "string", analyzer: "greeklish_analyzer" }
                    }
                }
            }
        }
    }

}

chief commented 9 years ago

Do you use this version ?

mixalistzikas commented 9 years ago

Yes... but it is not compatible...

chief commented 9 years ago

ok we will check it, thanks

astathopoulos commented 9 years ago

@mixalistzikas please apply stem_greek after greeklish_analyzer. With your configuration the stemmed version of the word Δημητρης (δημητρ if I am correct) is going to be expanded in greeklish words. So, when you search for dimitris, there is no matching document.

Furthermore, we have version 0.11 which is compatible with elasticsearch 1.5.x. Check our v1.5.0 branch!

kvavliak commented 9 years ago

Hello, Greek -> Latin works fine with "greeklish" type (.e.g 'δημητρης' -> "dhmhtrhs", "dimhtrhs", "dhmitrhs", "dimitrhs", etc) Is there any way to convert Latin to Greek characters? (e.g. 'dimitris' -> "δημητρης", "διμιτρις", "ντιμιτρις", etc) ? Thank you in advance.

astathopoulos commented 9 years ago

Hi, unfortunately this plugin converts only from greek to latin. There is no easy way to distinguish a greeklish word from any other latin word in order to converted it to the corresponding greek word. And if you have a multilanguage index (e.g. english and greek), the greeklish analyzer would analyze all the latin words of you index which does not make sense.