Term aggregation in custom query for features

opendistro-for-elasticsearch / anomaly-detection

A machine learning plugin in Open Distro for real time anomaly detection on streaming data.

https://opendistro.github.io/for-elasticsearch-docs/docs/ad/

Apache License 2.0

78 stars 36 forks source link

Term aggregation in custom query for features #88

Closed amirmuminovic closed 4 years ago

amirmuminovic commented 4 years ago

I tried to perform anomaly detection on a set of data where I am trying to locate anomalous site visits based on the country of origin. I tried to perform an aggregation by the country field but the detector showed no data, despite the fact I seeded some anomalous data.

Can features be aggregated by terms? If not, is that feature planned to be supported?

kaituo commented 4 years ago

Which API/kibana pages are you using? How much data points do you have?

amirmuminovic commented 4 years ago

I am using the "Model Definition" Kibana page and specifying the aggregation query in the "Add Feature" section.

I am testing anomaly detection with 500 000 data points over the span of three months.

kaituo commented 4 years ago

What is the date range of your aggregation query? And how many data points in that range? We need at least 400 data points to train our models. If not, we will not show feature data and anomaly grade graphs.

amirmuminovic commented 4 years ago

The date range is 5 days. There are definitely enough points.

I always try the model first by adding a feature and using a single-value aggregation, for this example let's say avg API response time. I get a nice graph with possible anomalies so I am sure that the number of data points isn't the problem.

My problem is that I am not sure if your model accepts term aggregation. Rather than giving it a single numerical value, I give it an array of objects, ie: [ { "country": "US", "doc_count": 1000 }, { "country": "CA", "doc_count": 700, }, { "country": "FR", "doc_count": 2, } ]

In this case I would like FR (France) to be marked as an anomaly because the document count is greatly different from other countries.

Is this feature supported or am I doing something wrong?

ylwu-amzn commented 4 years ago

@amirmuminovic We don't support this use case currently. You want to find out anomalies for non-time series data?

amirmuminovic commented 4 years ago

I want to find anomalies for time series but I want the feature input to be the result of bucket aggregation (like all terms) not single value aggregations (avg, sum)...

amirmuminovic commented 4 years ago

@ylwu-amzn @kaituo Any updates on this?

ylwu-amzn commented 4 years ago

@amirmuminovic we don't support this currently. This is a high cardinality problem which is already on our plan. Will provide your case to team. Welcome to provide more use cases, so we can focus on more valuable features.

amirmuminovic commented 4 years ago

@ylwu-amzn Thank you for a reply, I'm happy to hear it's on your plan :D