jprante / elasticsearch-plugin-bundle

A bundle of useful Elasticsearch plugins
GNU Affero General Public License v3.0
110 stars 17 forks source link

Docs: searching for example #27

Open ThaDafinser opened 7 years ago

ThaDafinser commented 7 years ago

Hello,

i tried now to complete the examples for Kibana, see https://gist.github.com/ThaDafinser/d27b4fa9d144b0083ee7dad37484fdd8

For the example i've gone through the complete plugin-list https://github.com/jprante/elasticsearch-plugin-bundle#a-plugin-bundle-for-elastisearch

For those plugins i couldn't find docs ( @jprante could cou help me here pls?)

Other missing examples for now (could not create a "live" example yet)

Are there any other things missing? When they are finished: Do you want them in README or in a seperate file?

ThaDafinser commented 7 years ago

For auto_phrase i found so far (could not get it working)

GET _analyze
{
  "tokenizer": "standard",
  "filter": [
    {
      "type": "auto_phrase",
      "phrases": [
        "C:/Data/test.txt"
      ]
    }
  ],
  "text": "what is my income tax refund this year now that my property tax is so high"
}
https://github.com/jprante/elasticsearch-plugin-bundle/blob/68dc19c34c40364e04400f92500b973a6cbae170/src/main/java/org/xbib/elasticsearch/index/analysis/autophrase/AutoPhrasingTokenFilterFactory.java
nkrot commented 7 years ago

Hi,

In addition to the original issue, LemmatizeTokenFilter lacks description too. I would appreciate any info on how to configure it, on supported languages and what is behind this plugin.

To me this plugin looks similar to baseform plugin. From skimming through the code I can guess that the lemmatizer replaces the original word while baseform-er adds generated form alongside the original.

Thanx

ThaDafinser commented 7 years ago

@nkrot in general you gave the answer.

I updated the gist with an example. Like you said, it just keeps the baseform and removes the original word

GET _analyze
{
  "tokenizer": "standard",
  "filter": [
    {
      "type": "lemmatize",
      "language": "de"
    }
  ],
  "text": "Ich gehe gerne mit meinen neuen Schuhen"
}
nkrot commented 7 years ago

@ThaDafinser , thank you. Do you have any info on

  1. respectKeywords, available in lemmatize plugin
  2. lemmaOnly, available in lemmatize plugin
  3. from where come lemmatizer resources (FSA) and how they compare to baseform

thanx,

ThaDafinser commented 7 years ago

Sadly not yet.

You can see a lot of examples in the tests, how it should work. https://github.com/jprante/elasticsearch-plugin-bundle/blob/93ed7cb33b9c8095c279405467d4301422324655/src/test/java/org/xbib/elasticsearch/index/analysis/lemmatize/LemmatizeTokenFilterTests.java#L113

jprante commented 7 years ago

LemmatizeTokenFilter is still work in progress, in experimental stage. It is considered as an alternative to a synonym token filter https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-tokenfilter.html but based on a language-specific dictionary of known compound words.

ThaDafinser commented 7 years ago

After going through a lot of examples, code and so on...

I think the best would be to create something like this https://github.com/ThaDafinser/elasticsearch-plugin-bundle/blob/feature/doc/docs/index.md

For a "one pager" (or add all in Readme) there are too many things to explain, and with such an approach the documentation can be created step by step.

Like mentioned at the end, it's similar to the ES reference guide structure https://www.elastic.co/guide/en/elasticsearch/reference/5.3/index.html

@jprante what do you think? If you like it, i will add some more pages and create a PR for this one.