Get list of all available analyzers. Request for a new API?

lukas-vlcek commented 1 year ago

Is your feature request related to a problem? Please describe.

I am missing an option to get list of all available analyzers. There is already analyze API documentation and it mentions "built-in" analyzers. But for normal user there is no way how to learn what are all the options. And even for people who are familiar with the code the list is a subject to updates/changes. One option would be to document the list in the documentation page. But I think I would prefer if the cluster itself can give a list of its analyzers (and also tokenizers, charfilters, normalizers).

There are built-in analyzers at the cluster level (in fact the list is kept at the node level, see below)
There are also ad-hoc/custom analyzers at individual index level (but visibility of these should be subject of RBA rules?)

Describe the solution you'd like

As far as I understand list of all built-in analyzers is materialized once AnalysisModule.setupAnalyzers(plugins) is called. I think it would be useful to extend one of the _nodes/ APIs and give it option to return list of all built-in analyzers (and its componenets: tokenizers, ... etc). (It needs to be API at the "nodes"-level because AnalysisRegistry is kept per node and I think a list of built-in analyzers can differ on each node depending on installed plugins.)

As for the list of analyzers defined at the index level I am not sure at this point. Maybe later...

Describe alternatives you've considered

Alternative is to go to the documentation (which does not have this list) or go to the code (which is not an option for many people).

Additional context n/a

dtaivpp commented 1 year ago

@lukas-vlcek I couldn't agree more. What may even be interesting is if we could start to expose some of these through a generated "cluster documentation" page in dashboards. That could show a bit of the clusters meta information.

lukas-vlcek commented 1 year ago

BTW, I am looking at this and I am trying to implement a quick prototype. Feel free to assign me.

lukas-vlcek commented 1 year ago

Hi,

I prepared an experimental plugin with this functionality.

It can be found here: https://github.com/lukas-vlcek/OpenSearch-list-built-in-analyzers
First experimental release for OpenSearch 2.4.1 can be found here: https://github.com/lukas-vlcek/OpenSearch-list-built-in-analyzers/releases/tag/2.4.1
You can do ./gradlew clean build on main branch to build the plugin for recent OpenSearch 3.0.0-SNAPSHOT version.

At this point I would love to get some feedback.

Below are more details about what this plugin can offer.

How does it work?

Right now, every OpenSearch node has internal AnalysisRegistry object that can be easily injected to plugins. This object is the main interface when client want to get access to specific analysis component (analyzer, tokenizer, ... etc). The problem, however, is that this registry object can only return analysis component if you know its name upfront. Although this registry has the list of all known components internally it is kept private. The important point is that all the analysis components are lazily initialized. I think there is a good reason not to initialize analysis component if it never gets used (and initialize it only before first use). It is cost saving especially for components based on large dictionaries.

Internally, the analysis registry contains Maps that have the key pointing to analysis component providers. So what I ended up doing is that I used reflection to get access to those internal Maps and I pulled keySets from them. I think this should be pretty safe and should not introduce any vulnerabilities. Those Maps are initialized at the node bootstrap (which means that the keySet is not changing later).

The content of those Maps/keySets depends on AnalysisPlugins that are found during node bootstrap (there are a few components available OOTB but most of them come from modules/plugins). This means that if you install any additional plugin the list will expand. I found that every AnalysisPlugin exposes information about which analysis components it is introducing to the system and I am using this information to provide more detailed information about available (built-in) analysis components.

Notice

Because it is implemented as plugin I had to use reflection API to gain access to information that is not exposed to plugins (hence security-plugin.policy is in place and plugin installation requires confirmation). If it were implemented as a core component I would consider implementing some further changes directly in OpenSearch so that reflection would not be needed.

Example

Imagine OpenSearch with the following plugins installed:

GET http://localhost:9200/_cat/plugins?v

name                     component           version
Lukass-MacBook-Pro.local analysis-icu        2.4.1
Lukass-MacBook-Pro.local analysis-kuromoji   2.4.1
Lukass-MacBook-Pro.local analysis-phonetic   2.4.1
Lukass-MacBook-Pro.local node-analyzers      1.0.0.0-rc.1

This yields the following comprehensive list of analysis components:

GET http://localhost:9200/_nodes/analyzers?pretty

{
  "_nodes" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "cluster_name" : "opensearch",
  "nodes" : {
    "SSlN30D0RUmDieqwlmp4RA" : {
      "analyzers" : [
        [
          "standard",
          "german",
          "irish",
          "pattern",
          "sorani",
          "simple",
          "hungarian",
          "norwegian",
          "dutch",
          "chinese",
          "default",
          "estonian",
          "arabic",
          "bengali",
          "english",
          "fingerprint",
          "portuguese",
          "keyword",
          "romanian",
          "french",
          "czech",
          "greek",
          "indonesian",
          "swedish",
          "spanish",
          "danish",
          "russian",
          "cjk",
          "kuromoji",
          "armenian",
          "basque",
          "italian",
          "lithuanian",
          "thai",
          "persian",
          "catalan",
          "finnish",
          "stop",
          "brazilian",
          "turkish",
          "hindi",
          "bulgarian",
          "snowball",
          "whitespace",
          "galician",
          "icu_analyzer",
          "latvian"
        ]
      ],
      "tokenizers" : [
        [
          "standard",
          "lowercase",
          "kuromoji_tokenizer",
          "pattern",
          "thai",
          "uax_url_email",
          "PathHierarchy",
          "simple_pattern_split",
          "classic",
          "path_hierarchy",
          "edgeNGram",
          "nGram",
          "letter",
          "simple_pattern",
          "ngram",
          "keyword",
          "whitespace",
          "icu_tokenizer",
          "edge_ngram",
          "char_group"
        ]
      ],
      "tokenFilters" : [
        [
          "standard",
          "uppercase",
          "decimal_digit",
          "persian_normalization",
          "bengali_normalization",
          "flatten_graph",
          "kuromoji_readingform",
          "pattern_replace",
          "kuromoji_part_of_speech",
          "scandinavian_folding",
          "stemmer_override",
          "kuromoji_baseform",
          "multiplexer",
          "trim",
          "truncate",
          "fingerprint",
          "limit",
          "czech_stem",
          "word_delimiter_graph",
          "cjk_bigram",
          "german_normalization",
          "hindi_normalization",
          "pattern_capture",
          "kstem",
          "icu_collation",
          "arabic_stem",
          "condition",
          "stop",
          "min_hash",
          "hunspell",
          "brazilian_stem",
          "keep",
          "unique",
          "snowball",
          "edge_ngram",
          "icu_transform",
          "keyword_marker",
          "word_delimiter",
          "synonym_graph",
          "ja_stop",
          "kuromoji_number",
          "keep_types",
          "french_stem",
          "arabic_normalization",
          "elision",
          "icu_normalizer",
          "porter_stem",
          "sorani_normalization",
          "icu_folding",
          "hyphenation_decompounder",
          "stemmer",
          "synonym",
          "phonetic",
          "nGram",
          "german_stem",
          "delimited_payload",
          "cjk_width",
          "lowercase",
          "serbian_normalization",
          "scandinavian_normalization",
          "length",
          "remove_duplicates",
          "reverse",
          "apostrophe",
          "russian_stem",
          "dutch_stem",
          "kuromoji_stemmer",
          "classic",
          "edgeNGram",
          "predicate_token_filter",
          "asciifolding",
          "concatenate_graph",
          "indic_normalization",
          "shingle",
          "common_grams",
          "ngram",
          "dictionary_decompounder"
        ]
      ],
      "charFilters" : [
        [
          "mapping",
          "html_strip",
          "kuromoji_iteration_mark",
          "icu_normalizer",
          "pattern_replace"
        ]
      ],
      "normalizers" : [
        [
          "lowercase"
        ]
      ],
      "plugins" : {
        "plugin" : {
          "name" : "org.opensearch.analysis.common.CommonAnalysisPlugin",
          "analyzers" : [
            [
              "arabic",
              "armenian",
              "basque",
              "bengali",
              "brazilian",
              "bulgarian",
              "catalan",
              "chinese",
              "cjk",
              "czech",
              "danish",
              "dutch",
              "english",
              "estonian",
              "fingerprint",
              "finnish",
              "french",
              "galician",
              "german",
              "greek",
              "hindi",
              "hungarian",
              "indonesian",
              "irish",
              "italian",
              "latvian",
              "lithuanian",
              "norwegian",
              "pattern",
              "persian",
              "portuguese",
              "romanian",
              "russian",
              "snowball",
              "sorani",
              "spanish",
              "swedish",
              "thai",
              "turkish"
            ]
          ],
          "tokenizers" : [
            [
              "PathHierarchy",
              "char_group",
              "classic",
              "edgeNGram",
              "edge_ngram",
              "keyword",
              "letter",
              "lowercase",
              "nGram",
              "ngram",
              "path_hierarchy",
              "pattern",
              "simple_pattern",
              "simple_pattern_split",
              "thai",
              "uax_url_email",
              "whitespace"
            ]
          ],
          "tokenFilters" : [
            [
              "apostrophe",
              "arabic_normalization",
              "arabic_stem",
              "asciifolding",
              "bengali_normalization",
              "brazilian_stem",
              "cjk_bigram",
              "cjk_width",
              "classic",
              "common_grams",
              "concatenate_graph",
              "condition",
              "czech_stem",
              "decimal_digit",
              "delimited_payload",
              "dictionary_decompounder",
              "dutch_stem",
              "edgeNGram",
              "edge_ngram",
              "elision",
              "fingerprint",
              "flatten_graph",
              "french_stem",
              "german_normalization",
              "german_stem",
              "hindi_normalization",
              "hyphenation_decompounder",
              "indic_normalization",
              "keep",
              "keep_types",
              "keyword_marker",
              "kstem",
              "length",
              "limit",
              "lowercase",
              "min_hash",
              "multiplexer",
              "nGram",
              "ngram",
              "pattern_capture",
              "pattern_replace",
              "persian_normalization",
              "porter_stem",
              "predicate_token_filter",
              "remove_duplicates",
              "reverse",
              "russian_stem",
              "scandinavian_folding",
              "scandinavian_normalization",
              "serbian_normalization",
              "snowball",
              "sorani_normalization",
              "stemmer",
              "stemmer_override",
              "synonym",
              "synonym_graph",
              "trim",
              "truncate",
              "unique",
              "uppercase",
              "word_delimiter",
              "word_delimiter_graph"
            ]
          ],
          "charFilters" : [
            [
              "html_strip",
              "mapping",
              "pattern_replace"
            ]
          ],
          "hunspellDictionaries" : [
            [ ]
          ]
        },
        "plugin" : {
          "name" : "org.opensearch.plugin.analysis.AnalysisPhoneticPlugin",
          "analyzers" : [
            [ ]
          ],
          "tokenizers" : [
            [ ]
          ],
          "tokenFilters" : [
            [
              "phonetic"
            ]
          ],
          "charFilters" : [
            [ ]
          ],
          "hunspellDictionaries" : [
            [ ]
          ]
        },
        "plugin" : {
          "name" : "org.opensearch.plugin.analysis.icu.AnalysisICUPlugin",
          "analyzers" : [
            [
              "icu_analyzer"
            ]
          ],
          "tokenizers" : [
            [
              "icu_tokenizer"
            ]
          ],
          "tokenFilters" : [
            [
              "icu_normalizer",
              "icu_folding",
              "icu_transform",
              "icu_collation"
            ]
          ],
          "charFilters" : [
            [
              "icu_normalizer"
            ]
          ],
          "hunspellDictionaries" : [
            [ ]
          ]
        },
        "plugin" : {
          "name" : "org.opensearch.plugin.analysis.kuromoji.AnalysisKuromojiPlugin",
          "analyzers" : [
            [
              "kuromoji"
            ]
          ],
          "tokenizers" : [
            [
              "kuromoji_tokenizer"
            ]
          ],
          "tokenFilters" : [
            [
              "kuromoji_baseform",
              "kuromoji_stemmer",
              "ja_stop",
              "kuromoji_number",
              "kuromoji_readingform",
              "kuromoji_part_of_speech"
            ]
          ],
          "charFilters" : [
            [
              "kuromoji_iteration_mark"
            ]
          ],
          "hunspellDictionaries" : [
            [ ]
          ]
        }
      }
    }
  }
}

What is next?

First, I would like to get some feedback. I think this feature can be useful (self-documenting is a perfect example). If this functionality is found useful I am more than happy to prepare PR as a core component (and not as a standalone plugin).
Second, I would like to explore how much structural information I can get about individual analysis components. For example I would like to be able to provide more information about internals of analyzers (which tokenizers and filters it is composed of, whether it is a wrapping another analyzer... etc).

HTH, Lukáš

dtaivpp commented 1 year ago

@lukas-vlcek This looks slick! I will try it out and see about getting it socialized a bit so we can have some feedback.

lukas-vlcek commented 1 year ago

FYI, if you will be testing the 1.0.0-rc.1 plugin then please be aware of some known issues and also some fixes that are not included in that release. Of course you can always build the plugin from the source...

dtaivpp commented 1 year ago

I am just building myself. 2 questions:

@lukas-vlcek can you present this on 1/17 https://forum.opensearch.org/t/opensearch-community-meeting-2023-0117/11891
Checking my understanding here. When I create a new index template with an analyzer should that show up in this list or is it just for core analyzers?

Example:

PUT _template/twitter
{
            "index_patterns": [
                "twitter*"
            ],
            "template": {
                "settings": {
                    "analysis": {
                        "analyzer": {
                            "text_analyzer": {
                                "tokenizer": "standard",
                                "filter": [ "stop" ]
                            }
                        }
                    }
                },
                "mappings": {}
        }
    }

Here I was thinking text_analyzer would show in the list but it wasn't from what I could tell. I queried as both privileged and unprivileged users.

lukas-vlcek commented 1 year ago

@dtaivpp

Yes, I am glad to present this work. Feel free to include me, I am already signed up for the meeting.
The plugin pulls the list of all the "build-in" analysis components. That means the text analysis building blocks available to all users (and these are defined at the node level). It is by definition a static list, it does not change during the life of the cluster (besides cluster rolling upgrade which can bring in a new version of OpenSearch or add another AnalysisPlugin). This suggested new API is primarily meant to provide complementary information for the documentation (or as you pointed out earlier, "generated documentation"). On the other hand, the custom/ad-hoc analyzer components defined at the index level is a different thing. These can change frequently and they live at the index level (so maybe /_index/analyzers would be more appropriate end point for such information). And the biggest difference is that they are not available to all users, if the user does not see the index (does not have the privs to see it) then he/she should not see such analyzer components as well (some sensitive info could leak this way). Other users can not "re-use" there analyzers as well, they always have to recreate them on their own indices, ... etc.

andrross commented 1 year ago

My two cents here is that implementing this as a core component is the right way to go architecturally. Unfortunately I missed the January 17 community meeting but is there any additional feedback to incorporate here as to the structure of the API itself?

lukas-vlcek commented 1 year ago

@andrross I think the best place to provide feedback about this functionality is here, in this ticket. I am going to release a new RC version because some issues about output format has been fixed in the meantime, and OpenSearch 2.5 has been released as well.

Yes, I agree the best way would be to integrate it directly into the core. But as a proof of concept it was easier for me to implement it as a plugin because I did not have to care much about frequent changes happening in main branch.

As for the output format I remember one feedback was that this information could be part of the _cat API as oppose to introducing a new REST API. I liked this idea initially but now I do not think that would be a good fit, mostly because I can not think of good response format for the _cat API.

andrross commented 1 year ago

I think the best place to provide feedback about this functionality is here, in this ticket.

Agreed! I was just asking to capture any feedback from the meeting into this ticket :)

...could be part of the _cat API

Yeah it does seem to be hard to model the structured/nested data in this API in the CAT format. On that front though, the large JSON response payload isn't the most human-readable format, so some sort of admin UI would seem to be a good fit here.

lukas-vlcek commented 1 year ago

@macohen If there are any questions, feel free to ping me, I am happy to help.

macohen commented 1 year ago

@lukas-vlcek are you planning to keep going on this? I would encourage that! Mostly I brought it into the Search Applications Vertical project because there's some alignment there in other ways.

macohen commented 1 year ago

@andrross do you think the admin UI is required to launch this? @lukas-vlcek, are you able to turn this into a core component? I agree with Andrew because some analyzers are in core already.

lukas-vlcek commented 1 year ago

@macohen Making it a core component is perfectly possible and will make implementation a little bit more transparent/clean.

msfroh commented 1 year ago

I just wanted to call out a related behavior that I just learned about for pipelines and processor plugins.

The NodeInfo class has a field of type IngestInfo that keeps track of ingest processors available on every node. When a new ingest pipeline is created, the node that receives the request fetches all of the NodeInfos, and confirms that every processor used in the pipeline is available on every node (to avoid a situation where some nodes fail to run the pipeline). I just added similar logic for search pipelines, since I copied the idea from ingest pipelines.

It feels like we would have a similar situation with analyzers, where you could specify an analyzer chain for a given field in your mapping, but it would only work reliably if every component of the chain is available on every node. @lukas-vlcek, do you know off-hand how mappings accomplish that? (I'm guessing there's got to be some kind of validation to make sure that analyzer plugins are installed everywhere, right?)

I'm wondering if there might be some opportunity to make the implementations more consistent between analyzers and processor pipelines (either putting everything into NodeInfo, so it all gets returned via the /_nodes API, or we could move the pipeline processor info into a sub-API like you've done here for analyzers, making NodeInfo a little smaller).

lukas-vlcek commented 1 year ago

@msfroh Thanks for looking at this. I am going to look at that.

macohen commented 1 year ago

Checking in @lukas-vlcek... need anything on this one?

lukas-vlcek commented 1 year ago

Pushed first draft PR. Except for some more tests it should be ready for review. The documentation PR is missing at this point (I will open it shortly).

macohen commented 10 months ago

@krishna-ggk can you please take a look at this issue and the associated PR?

@lukas-vlcek, do you think this is still on track for 2.12? Not a problem to move to 2.13, but just wanted to make sure we get everything aligned for 2.12 if this is still good to go.

Thanks!

macohen commented 9 months ago

@lukas-vlcek Jan 9th is the current entry window for 2.12. If we work back from there, I'd say having the PR merged by Jan 7th would be a good target to know if this will make it. Should we move to 2.13?

lukas-vlcek commented 9 months ago

@macohen Sorry for not replying sooner, I will do my best to make it by Jan 7th. It is on my priority list now.

kiranprakash154 commented 8 months ago

Hi, are we on track for this to be released in 2.12 ?

lukas-vlcek commented 8 months ago

Hi @kiranprakash154, depends on when is the code freeze for 2.12 and if we get more reviews on this PR. I am currently finishing documentation PR, I will push it on Monday.

macohen commented 8 months ago

Code freeze for 2.12 is Feb 6th. @lukas-vlcek do you need any assistance to get this in?

lukas-vlcek commented 8 months ago

@macohen

The documentation part has been already reviewed and approved (https://github.com/opensearch-project/documentation-website/pull/6252).
The code part is (IMO) suffering from flaky tests and still needs more reviews (https://github.com/opensearch-project/OpenSearch/pull/10296).

hdhalter commented 7 months ago

@macohen - Can we please bump this up to release train 2.13?

hdhalter commented 6 months ago

@macohen - Can we please bump this up to release train 2.13?

Since this is still on the 2.13 roadmap, I'll move it to the 2.13 release train in the project.

hdhalter commented 6 months ago

Hi @lukas-vlcek, We closed the doc issue for List Analyzers Through _cat" (https://github.com/opensearch-project/documentation-website/issues/5426), but I don't think there was a doc issue for this one, specifically. Are we releasing this in 2.13 and will it need documentation? Thanks!

dblock commented 6 months ago

https://github.com/opensearch-project/OpenSearch/pull/10296 is next to be merged, then we can update documentation for 2.13 accordingly.

getsaurabh02 commented 5 months ago

@lukas-vlcek should I update the tag to 2.15?

opensearch-project / OpenSearch