opensearch-project / anomaly-detection-dashboards-plugin

Manage your detectors and identify atypical data in OpenSearch Dashboards
https://opensearch.org/docs/latest/monitoring-plugins/ad/index/
Apache License 2.0
30 stars 58 forks source link

Occasional missing data even though heatmap cell indicates anomalies #119

Open ohltyler opened 3 years ago

ohltyler commented 3 years ago

Occasionally, a heatmap cell summary will indicate an anomaly present, but when clicked on, shows 0 available anomalies.

The anomaly summaries and the anomaly data are fetched in 2 different calls, so likely the issue has to do with the time bounds being different between the two queries. If an anomaly is on the edge, it may be getting included in the summary, but not included in the raw results, leading to the discrepancy.

Screenshot of the error: Screen Shot 2021-11-07 at 5 29 15 PM

prpaluch commented 2 years ago

Opensearch version 1.1.0.0 I realized this odd behavior too, but this is not occasionally, in our case it does not seem to work at all. The live dashboard shows live anomalies. These live anomalies are not present in the anomaly overview of the heat map or vice versa. If you click on an anomaly entity (red square) in the heatmap it should show at least the anomaly occurence in the table, but shows 0 as described above, and I think it should also show anomaly occurences in the anomaly grade/confidence graph, but instead it displays Click on an anomaly entity to view data.

In OpenSearch version 1.0.0.0, the entity name appeared in the anomaly, confidence graph. and it showed the anomaly occurences as i can remember, but it never showed a confidence/anomaly graph. This is only the case if we explicitly do not use a Categorical field but instead a data filter for exaclty one entity. Then we can see a confidence graph and anomaly detection.

ohltyler commented 2 years ago

@prpaluch hello, thanks for bringing this issue up. Regarding the differences between 1.1.0.0 and 1.0.0.0: there was some changes in 1.1.0.0 which reformatted the layout of the charts shown below the heatmap, but should not have affected their functionality regarding showing populated anomalies or not. Additionally, both versions show the same content (anomaly results table, anomaly results chart, and feature breakdown).

If you are seeing "Click on an anomaly entity to view data", then that means the chart isn't recognizing a heatmap cell has been selected. If it's selected, but there is no results, then you should see the error like in the screenshot above ("There are no anomalies currently."). If you are consistently seeing the former when trying to click on a heatmap cell, then that may be because of cluster load (taking a long time to update the charts), or some separate issue. If so, can you open a separate issue describing how to reproduce that bug? I would be happy to help assist and deep dive that problem, thanks.

prpaluch commented 2 years ago

Hi @ohltyler, is there an option to debug this behavior or maybe find out the cluster load, or that the cluster is overladed? The detector runs on several indices with a wildcard * option. Each index could grow to 1-2 tb on data. The data for a specific index can delay 1-2 days, means data can be written to an index that is 2 days old. Currently the whole configuration runs as a POC to see how anomaly detection performs with our data situation and runs on 3 nodes. with a total of 264 Gb of Memory. So currently nothing happens if we click on a heat map square except that all other cells are greyed out, the chart stays the same also after 1 Minute waiting.

prpaluch commented 2 years ago

One note to say, over GET _plugins/_anomaly_detection/stats and POST _plugins/_anomaly_detection/detectors/results/_search for each detector, data seems to be available.

ohltyler commented 2 years ago

@prpaluch thanks for those details. When a cell is selected, the plugin makes a call to fetch the individual anomaly results for the detector filtered on the cell's time range. My assumption is that request is timing out due to cluster load, or there is a bug in displaying the results.

To check for errors, can you try selecting a heatmap cell and waiting a few minutes, while monitoring the console output using Chrome's dev tools (or another browser's version of it)? That should give you more information on if the request is timing out, or any errors occurring.

You can check real-time cluster-level stats using the node stats API (GET /_nodes/stats) to see if there is high pressure on the nodes / if there is failures. It sounds like the anomaly detectors themselves are working, just an issue in displaying on the webpage.

prpaluch commented 2 years ago

Hi @ohltyler thanks for the suggestions you made to solve the problem, really appreciate your help,

these errors could be found in the chrome dev. console during loading the detector and clicking on a heatmap cell. I guess this could be the problem we experienced?

The first exception is thrown if i visit the side of our configured detector: anomalyDetectionDashboards.plugin.js:1 Uncaught (in promise) Configured indices are not found: [.opendistro-alerting-config]

And next if I click on a cell, this exception is thrown in the console:

_a_nomalyDetectionDashboards.chunk.2.js:1 Uncaught TypeError: entityListAsString.split is not a function at convertToEntityList (anomalyDetectionDashboards.chunk.2.js:1) at EventEmitter.handleHeatmapClick (anomalyDetectionDashboards.chunk.2.js:1) at EventEmitter.emit (anomalyDetectionDashboards.chunk.1.js:15) at HTMLDivElement.plotObj.emit (anomalyDetectionDashboards.chunk.1.js:63) at emitClick (anomalyDetectionDashboards.chunk.1.js:63) at Object.click (anomalyDetectionDashboards.chunk.1.js:63) at Object.clickFn (anomalyDetectionDashboards.chunk.1.js:63) at HTMLDocument.onDone (anomalyDetectionDashboards.chunk.1.js:63)

Here the two screenshots of the console console1

console2

ohltyler commented 2 years ago

@prpaluch great, this is really helpful. The first exception (missing alerting config) is okay and will not interrupt the plugin from working. The second exception, when clicking on a cell, seems to be where the issue is. Might be due to an NPE on entityListAsString, which is possibly behind the other errors regarding negative value widths. I'll investigate more on the root cause and see what I can find.

ohltyler commented 2 years ago

Looks to be an NPE happening on this line, which means selectedEntityString (declared on this line) is returning null or undefined.

If the heatmap chart is populated with anomalies and if there are values on the y-axis, those should be persisted in the chart data, where points[0].y (the entity list) shouldn't be able to be empty. Can you confirm what the value on the y-axis is showing? Additionally, can you share what the values are when hovering your mouse over one of the cells you are trying to select?

Apologies for all of the requested info. I'm unable to reproduce this myself and am trying to figure out what data could lead to this error being thrown.

To help unblock you as well, you may look into upgrading to 1.2.0.0. A lot of the logic around the heatmap cells has been refactored, and the function that is causing the NPE here has actually been removed. Regardless, this is helpful to try to fix this bug in order to provide a patch for users using 1.1.0.0 (and potentially 1.0.0.0). Thanks!

prpaluch commented 2 years ago

hi sorry for the late response, yes the values are available on the y axis, the entity is shown that we selected for the category field. if you hover with the mouse over a square it shows the entity name, time, anomaly grade and occurrence. But what we can see is, that the square lengths seemed to be not scaled in an appropriate manner, means the hight of the cells is somehow randomly choosen and not equal. I attach two screenshots, first is for 7 days time range, and the last is for 24 hours. The one for 7 days show entity 288. On the y axis you can see entities that are overlapped, maybe no anomalies detected, this looks not okay seems to be a display bug somehow, and the second picture shows last 24 hours. Here the entities are totally mixed up within one cell. Entity 288 is completely not available and it shows anomalies for entity 101 that it did not show for the last 7 days. Could it simply be that the x,y positions are not correct for a cell and therefore it cannot find the selectedEntityString because it cannot display it in proper way in terms of scalable cells.? Does this make sense? Why are the squares different in height, this does not look correct to me? I will try out version 1.2 if available. Thanks four your help. hover hover2

prpaluch commented 2 years ago

Checked version 1.2.0 and our problems are gone, it seems that the refactoring in version 1.2.0 helped a lot to make the heatmap work. Now it is possible to click on a square and you get the anomalies and also the confidence graph as it should. One problem that still exists in version 1.2.0 is the overlapping y axis and the heat cells that are not scaled correctly. But currently we are happy to view anomalies by clicking on the cells.

ohltyler commented 2 years ago

Thanks for the detailed information! I'm glad that 1.2 is working better for you, but sounds like there is still issues with the y-axis scaling. Your assumption about the overlapping y-axes causing an empty selectedEntityList in 1.1 are probably related - I'm assuming that the aggregated data within the cell is probably corrupt because of this, and leads to the errors (which isn't handled properly in 1.1).

If you don't mind, could you provide some more details on the index and detector configuration? I'd like to try to reproduce locally to root cause the bug. Some useful info would be:

prpaluch commented 2 years ago

hi yes of course. We are analysing ha-proxy logs. Here is the index template for the log data. It contains the settings and mappings. You can set it up via dev tools.

POST /_template/haproxy_example { "order" : 0, "index_patterns" : [ "haproxyus*" ], "settings" : { "index" : { "number_of_shards" : "6", "number_of_replicas" : "1" } }, "mappings" : { "properties" : { "server_name" : { "type" : "keyword" }, "srvconn" : { "type" : "integer" }, "actconn" : { "type" : "integer" }, "Ta" : { "type" : "integer" }, "Tc" : { "type" : "integer" }, "client_port" : { "type" : "integer" }, "popId" : { "type" : "keyword" }, "http_method" : { "type" : "keyword" }, "backend_name" : { "type" : "keyword" }, "beconn" : { "type" : "integer" }, "hostIpAddress" : { "type" : "ip" }, "client_ip" : { "type" : "ip" }, "Tr" : { "type" : "integer" }, "bytes_uploaded" : { "type" : "long" }, "frontend_ip" : { "type" : "ip" }, "Tw" : { "type" : "integer" }, "http_uri" : { "type" : "text", "fields" : { "keyword" : { "ignore_above" : 256, "type" : "keyword" } } }, "http_status_code" : { "type" : "integer" }, "termination_state" : { "type" : "text" }, "feconn" : { "type" : "integer" }, "srv_queue" : { "type" : "integer" }, "http_version" : { "type" : "keyword" }, "bytes_read" : { "type" : "long" }, "data_field" : { "type" : "keyword" }, "frontend_port" : { "type" : "integer" }, "backend_queue" : { "type" : "integer" }, "retries" : { "type" : "integer" }, "frontend_name" : { "type" : "keyword" }, "TR" : { "type" : "integer" } } }, "aliases" : { } }

And here a snippet how the data looks like, all xxxxxxx are string values

"hits" : [ { "_index" : "haproxyus-logging-x", "_type" : "_doc", "_id" : "_lcMp30BxL-IX7Ry8qBn", "_score" : 1.0, "_source" : { "Ta" : 1216872296, "frontend_name" : "xxxxxxxxxxx", "Tt" : 1216872303, "popId" : "123", "Tw" : 0, "termination_state" : "----", "@version" : "1", "backend_name" : "xxxxxxxxxxx", "server_name" : "xxxxxxxxx", "client_port" : 65141, "bytes_read" : 7411465, "TR" : 0, "hostIpAddress" : "xxxxxxxxxx", "frontend_port" : "8080", "backend_queue" : 0, "Tc" : 0, "feconn" : 123615, "Th" : 0, "Tr" : 3, "Ti" : 7, "http_version" : "HTTP/1.1", "host" : "xxxxxxxxxxxxx", "retries" : 0, "srv_queue" : 0, "srvconn" : 2452, "actconn" : 146466, "http_uri" : "xxxxxxxxxxx", "@timestamp" : "2021-11-23T04:24:24.554Z", "client_ip" : "xxx.xxx.xxx.xxx", "beconn" : 140468, "bytes_uploaded" : 4708527, "Td" : 1216872293, "Tq" : 7, "http_status_code" : 200, "type" : "xxxxxxxxxxx", "http_method" : "CONNECT", "frontend_ip" : "xxx.xxx.xxx.xxx" } } ]

Here the detector configuation

GET _plugins/_anomaly_detection/detectors/g61xN30Bju0JQ98M1BaN { "_id" : "g61xN30Bju0JQ98M1BaN", "_version" : 1, "_primary_term" : 1, "_seq_no" : 0, "anomaly_detector" : { "name" : "ha_proxy_rate_detector", "description" : "check the index rate for pops", "time_field" : "@timestamp", "indices" : [ "haproxyus-logging-*" ], "filter_query" : { "match_all" : { "boost" : 1.0 } }, "detection_interval" : { "period" : { "interval" : 6, "unit" : "Minutes" } }, "window_delay" : { "period" : { "interval" : 0, "unit" : "Minutes" } }, "shingle_size" : 8, "schema_version" : 0, "feature_attributes" : [ { "feature_id" : "gq1xN30Bju0JQ98MthYl", "feature_name" : "count_documents", "feature_enabled" : true, "aggregation_query" : { "count_documents" : { "value_count" : { "field" : "http_status_code" } } } } ], "ui_metadata" : { "features" : { "count_documents" : { "aggregationBy" : "value_count", "aggregationOf" : "http_status_code", "featureType" : "simple_aggs" } }, "filters" : [ ] }, "last_update_time" : 1637312746635, "category_field" : [ "popId" ], "user" : { "name" : "admin", "backend_roles" : [ "admin" ], "roles" : [ "own_index", "all_access" ], "custom_attribute_names" : [ ], "user_requested_tenant" : "user" }, "detector_type" : "MULTI_ENTITY" } }

If you need more information, i am glad to help you.

ohltyler commented 2 years ago

@prpaluch I'm able to reproduce the issue on 1.2 after setting up a similar local environment (similar indices, detector config). Will respond back once I can dive deeper into the root cause.

I have a small assumption that it's some conflict with the values of the y-axis, where it's read in the plotly heatmap as numerical rather than strictly string/keyword, since the y-axis values always seem to be organized in descending order no matter how it's filtered (by severity/occurrence, top 10/20/30, etc). Also, they seem to be spaced evenly based on their values - for example, the gaps are larger between numbers that are numerically farther apart (see 84 & 115), and the gaps are small between numbers that are numerically close (see 162 and 163 which are overlapping):

Screen Shot 2021-12-22 at 4 10 03 PM
ohltyler commented 2 years ago

Found out Plotly will automatically try to determine the data type based on the axis data given. In this case, it looks like it is labeling it as "linear" by default. I've tested a change of specifying the "type" as "category", and it now looks to show properly:

Screen Shot 2022-01-03 at 8 04 08 AM

I'll work on a patch for this and will update this issue with the latest progress.

ohltyler commented 2 years ago

Heatmap chart axes fixed as part of #167. Will still leave this issue open for tracking the root cause of occasional empty/missing data in the cells.