grafana / opensearch-datasource

Apache License 2.0
27 stars 21 forks source link

Visualizations in grafana end up hitting max buckets but the same visualization in kibana works fine #426

Open rkarthikr opened 4 months ago

rkarthikr commented 4 months ago

What happened: We've been working through the max bucket error in grafana with an opensearch datasource. Initially I thought the issue was with opensearch however we have bumped that max bucket limit to 65536 and we are still mostly seeing this error (some aggregations now work but most hit this limit and error). To compare I recreated the same simple visualization in kibana (or whatever the equivalent is called for opensearch) and I don't get any errors and it generates the visualization quickly. I suspect that the opensearch plugin is doing something differently than kibana that is causing it to hit this limit even with a high setting for the limit.

image

What you expected to happen: visualizations to work without hitting max buckets

How to reproduce it (as minimally and precisely as possible): Create a simple aggregation in grafana using an opensearch datasource Anything else we need to know?:

Environment:

kevinwcyu commented 3 months ago

Hi @rkarthikr, I tried running the query as shown in your screenshot but wasn't able to reproduce an error. Can you show what is in the query object by clicking on the Query Inspector and clicking on the Query tab in the inspector. The query will be listed in data.queries.

rkarthikr commented 3 months ago
{
  "traceId": "50384ed94095e8fe6eedfee4c020957a",
  "request": {
    "url": "api/ds/query?ds_type=grafana-opensearch-datasource&requestId=explore_o6v",
    "method": "POST",
    "data": {
      "queries": [
        {
          "refId": "A",
          "datasource": {
            "type": "grafana-opensearch-datasource",
            "uid": "ads67lnsevj0gd"
          },
          "query": "*",
          "queryType": "lucene",
          "alias": "",
          "metrics": [
            {
              "type": "count",
              "id": "1"
            }
          ],
          "bucketAggs": [
            {
              "type": "date_histogram",
              "id": "2",
              "settings": {
                "interval": "auto"
              },
              "field": "startTime"
            }
          ],
          "format": "table",
          "timeField": "startTime",
          "luceneQueryType": "Traces",
          "datasourceId": 12,
          "intervalMs": 60000,
          "maxDataPoints": 1515
        }
      ],
      "from": "1722177213354",
      "to": "1722180813354"
    },
    "hideFromInspector": false
  },
  "response": {
    "message": "An error occurred within the plugin",
    "messageId": "plugin.downstreamError",
    "statusCode": 500,
    "traceID": "50384ed94095e8fe6eedfee4c020957a"
  }
}
rkarthikr commented 3 months ago

@kevinwcyu - Any updates on this ?

iwysiu commented 3 months ago

Hi @rkarthikr ! I've been investigating this. I haven't been able to reproduce it, but I have found some differences between the query that the opensearch dashboard runs and the one we create, and we'll continue to investigate why those differences exist and whether they affect performance.

rkarthikr commented 3 months ago

I will reach out to you in Grafana Community Slack.

idastambuk commented 3 months ago

Hi @rkarthikr! You mention that you're getting max_buckets for this query, but I only see the plugin.downstreamError error. How did you discover this is a max buckets error and not an error in the plugin code? Thanks!

rkarthikr commented 3 months ago

Saw the error in the OpenSearch Logs. I tried increasing the max buckets config on OpenSearch end and i no longer get this error. But still get the plugin.downstreamError error with no additional details on error

Please let me know . happy to walk you through the demo env to see if you can use it to collect data for troubleshooting further

idastambuk commented 3 months ago

Hi @rkarthikr, it would be super helpful to get a step by step on how to set up a similar environment, since it seems like our backend might be running into errors with the data itself. Thanks a lot!

rkarthikr commented 2 months ago
  1. Demo Application - https://github.com/open-telemetry/opentelemetry-demo/tree/main/kubernetes. Deployed the application listed here in to an EKS Cluster
  2. Updated OTEL Config to send traces to OpenSearch
  3. Setup OpenSearch Datasource in Grafana
  4. Using Data Source Explorer - Explore Trace data for > 5 min and see error
rkarthikr commented 2 months ago

Please let me know if there is any way to enable Grafana logs that will help you to troubleshoot this further . I am using Grafana Cloud demo environment for this

superstes commented 2 months ago

Did see the same error while trying to explore data for my new project (local docker setup). (opensearchproject/opensearch:2 & grafana/grafana:11.1.4) Only a few 100 messages produced the max buckets error

kevinwcyu commented 2 months ago

Hi @rkarthikr, Could you share the visualization from the OpenSearch Dashboard (Kibana) that works? With the demo application, I still haven't been able to get an error related to the max bucket limit, but do get the same error shown in the screenshot in the description when I perform a trace query.

I think the plugin.downstreamError error might potentially be fixed by https://github.com/grafana/opensearch-datasource/pull/445, while we still have to try to figure out what is causing the max bucket error.

yotamN commented 2 months ago

Could it be the interval setting? I'm getting the same error sometimes (also with AWS OpenSearch) when setting the interval to auto but when I set it manually to a bigger number it works fine.
I can also see visually that the interval behavior is a bit different between Grafana and Kibana.

kevinwcyu commented 1 month ago

Could it be the interval setting? I'm getting the same error sometimes (also with AWS OpenSearch) when setting the interval to auto but when I set it manually to a bigger number it works fine. I can also see visually that the interval behavior is a bit different between Grafana and Kibana.

Hi @yotamN, There isn't an option to set the interval for Traces queries so I just wanted to clarify whether you are running a Traces query as shown in the issue description or aMetric query?

yotamN commented 1 month ago

Could it be the interval setting? I'm getting the same error sometimes (also with AWS OpenSearch) when setting the interval to auto but when I set it manually to a bigger number it works fine. I can also see visually that the interval behavior is a bit different between Grafana and Kibana.

Hi @yotamN, There isn't an option to set the interval for Traces queries so I just wanted to clarify whether you are running a Traces query as shown in the issue description or aMetric query?

On a second look I think I was wrong a bit in my error description, please tell me if it's relevant since I still get the same error in OpenSearch logs.

I set the interval to a constant number (since there isn't a way to set a minimum interval instead) and when I set a big range I get this error since there are too many buckets.

kevinwcyu commented 1 month ago

Hi @yotamN, we've seen the max bucket error for Metric queries in the past and we usually recommend adjusting the search.max_buckets setting in OpenSearch, but adjusting the interval is another way of tweaking the query to avoid hitting the error as well.

Since you mentioned you were setting the interval I just wanted to clarify if you were running a Metric query or a Traces query (like the one shown in the original issue description) because we haven't been able to reproduce the max bucket error for Traces queries yet. If it was a Traces query it would be good to get an example query to help us reproduce it.