regel / loudml

Loud ML is the first open-source AI solution for ICT and IoT automation
Other
298 stars 93 forks source link

Elasticsearch refine dataset with lucene queries #70

Open camAtGitHub opened 6 years ago

camAtGitHub commented 6 years ago

I have a few questions regarding LoudML and Elasticsearch.

1). With Loud is it possible to provide some sort of Elasticsearch-Lucene query to reduce the data set. Example I have all my SSH Logins/Failures in one index, but I really only want to create a model based on a few servers – Lucene query like: “host:server OR host:server2 OR host:server3” – would provide the correct data.

2). Is there integration with Grafana (not Chronograf) available? – I think this would be a killer feature, because then I could leverage the Elasticsearch datastore with the Grafana GUI and get Loud predictions in GUI format.

Thanks

regel commented 5 years ago

@camAtGitHub, Hi, many thanks for your feedback!

No Grafana integration yet, I will reply in private.

On point 1, we've implemented "match_all" that translates to narrow down the data-set with x AND y AND z Lucene queries. Not yet the equivalent OR conditions. It's difficult to support the extended Lucene query format. Do you think basic or/and should cover most needs?

camAtGitHub commented 5 years ago

@regel Re: match_all - For me the LoudML documentation lacks how to use 'match_all' with examples, unfortunately the Elasticseach documentation is of no-use unless your Elasticsearch programmer IMO.

An elasticsearch of:

GET _search

{
  "query":{
    "bool":{
      "must":[
        { "term":{ "program":"sshd" } },
        { "term":{ "authresult":"failure" } }
      ]
    }
  }
}

Provides the dataset I want, but I have no-idea how to apply this in the 'match_all' context with loudML.

Cheers

jorgelbg commented 5 years ago

@camAtGitHub The syntax of the match_all section is the same for both data sources. Underneath it would be translated into a WHERE tag selector on InfluxDB and on an equivalent form the ES. tag would be like the field name on ES.

"match_all": [
    {"tag": "program", "value": "sshd"}
    {"tag": "authresult", "value": "failure"}
]

If you want to keep this complexity hidden from the LoudML config, you can setup a filtered index alias (https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-aliases.html#filtered).

Perhaps having an option to print the ES query that is going to run could be useful, but the LoudML config is agnostic about which datasource you use. In this particular case is true that tag looks more focused on InfluxDB.

camAtGitHub commented 5 years ago

@jorgelbg unfortunately that syntax just doesn't work, loudML (as reported elsewhere) just reports:

INFO:root:Aggregations for model ssh-anomalies: Missing data
ERROR:root:no data found for time range 2018-11-13T22:15:00.000Z-2018-11-20T22:30:00.000Z

Although there 100% is data...

jorgelbg commented 5 years ago

@camAtGitHub which version of loudml are you using? Previously I had something similar but was related to the measurement setting in the model being ignored (which translates to the document type on ES). You should also check the document type (_type field on ES/Kibana).

I've been experimenting using the following model with the latest version 1.4.3 and it works.

{
    "bucket_interval": "5m",
    "default_datasource": "elastic",
    "timestamp_field": "@timestamp",
    "measurement": "logs",
    "features": [
      {
        "default": 0,
        "metric": "max",
        "field": "error_count",
        "measurement": "logs",
        "name": "error_count",
        "anomaly_type": "low_high",
        "match_all": [
          {"tag": "user_id", "value": "1234"}
        ]
      }
    ],
    "seasonality": {
        "daytime": true,
        "weekday": true
    },
    "interval": 10,
    "max_evals": 10,
    "name": "error_count",
    "offset": 120,
    "forecast": 30,
    "span": 30,
    "max_threshold": 25,
    "min_threshold": 10,
    "type": "timeseries"
  }
camAtGitHub commented 5 years ago

@jorgelbg - Just tried with 1.4.3 and nothing. Interesting the LoudML docs state:

For Elasticsearch data source, the measurement is not used. You can set the doc_type in config.yml data source settings. Default is doc if not set

https://loudml.io/guide/en/loudml/reference/current/timeseries-dsl.html

jorgelbg commented 5 years ago

@camAtGitHub Yes, that is the new behavior. Before I expected the measurement field to translate to the doc_type, which it didn't. You can see issue #42 for more info.

It's strange that it doesn't work. You could probably use tcpdump/wireshark to sniff the outgoing request and check what payload is being sent. If you have a custom document type on ES, then you need to set the doc_type in the config.yml. As stated in the documentation the default is doc. The doc_type changes the URL of the request, and would not return results.

camAtGitHub commented 5 years ago

@jorgelbg - doc type is 100% 'doc' Elastic runs full SSL so wireshark is ... hard - I have to mitm the connection...

camAtGitHub commented 5 years ago

I managed to get it working with:

{
  "bucket_interval": "15m",
  "default_datasource": "elastic1",
  "timestamp_field": "@timestamp",
  "features": [
    {
      "default": 0,
      "metric": "count",
      "name": "ssh_request_count",
      "match_all": [ {"tag": "program.keyword", "value": "sshd"}, {"tag": "authresult.keyword", "value": "failure"} ],
      "field": "src_port",
      "anomaly_type": "low_high"
    }
  ],
  "interval": 60,
  "max_evals": 10,
  "name": "ssh-anomalies",
  "offset": 0,
  "forecast": 5,
  "span": 20,
  "max_threshold": 0,
  "min_threshold": 0,
  "type": "timeseries"
}

Although it worked, I'm stuck on the next step - making it do anything useful....

ghost commented 5 years ago

Seconded - support for Grafana would be awesome. 👍

regel commented 5 years ago

@bdeam @camAtGitHub @jorgelbg : support for Grafana discussion in the community forum, https://community.grafana.com/t/metrics-forecast-and-outlier-detection-automl-automation/13906

6.x seems to use React, it's good news!

truongkendy commented 5 years ago

I encounter this error when create new model

truongkendy commented 5 years ago

"Unsupported model (type = 'timeseries')"