grafana / grafana

The open and composable observability and data visualization platform. Visualize metrics, logs, and traces from multiple sources like Prometheus, Loki, Elasticsearch, InfluxDB, Postgres and many more.
https://grafana.com
GNU Affero General Public License v3.0
65.19k stars 12.16k forks source link

Data Sets: Performance Issue with Large Data Sets #92044

Open yzengin opened 3 months ago

yzengin commented 3 months ago

What happened?

When using Grafana with large data sets, I've noticed that dashboards become slow, and graphs take a long time to load. This is especially noticeable when querying Prometheus as a data source. The response times for queries significantly increase, negatively impacting the user experience. Additionally, memory usage spikes when handling large volumes of data. Could there be optimizations to improve performance in such scenarios?

What did you expect to happen?

I expected the dashboards to load quickly and the graphs to be responsive even with large data sets. The performance should be consistent, regardless of the data size.

Did this work before?

Yes, previous versions of Grafana worked more smoothly with similar data sets, but I’ve noticed the slowdown with the latest version.

How do we reproduce it?

Steps to Reproduce:

  1. Create a dashboard with a large data set.
  2. Query Prometheus as the data source.
  3. Observe the query response times and memory usage.

Is the bug inside a dashboard panel?

Yes, the issue occurs within the dashboard panels where large data sets are visualized.

Environment (with versions)?

Grafana: 11.1.4 OS: Ubuntu 20.04 Browser: Google Chrome 127.0.6533.120

Grafana platform?

A package manager (APT, YUM, BREW, etc.)

Datasource(s)?

Prometheus 2.40.0

tonypowa commented 2 months ago

hello @yzengin

Could you share a minimal Git repository where the issue can be reproduced? Also please provide any error messages or logs you're seeing.

Thank you

NWRichmond commented 1 month ago

@yzengin a video would help us understand your experience. Is this something you can provide?

robhamnett commented 5 days ago

This is happening on our instance as well running 11.3 ever since the last update, it struggles to load, complaints from end users etc.

NWRichmond commented 4 days ago

@robhamnett if you're able to provide us with information about your current setup (see the Environment section above) and previous setup (before you experienced perf issues), that would be very helpful.

Any further details you can provide will help us, too. For example, does the dashboard in question only rely on the Prometheus data source?

robhamnett commented 4 days ago

We were running 11.2.2, updated to 11.3 os - container-optimized os v113 agent - victoria metrics v1.106.0

NWRichmond commented 4 days ago

@robhamnett thank you for the quick reply. One significant change in Grafana 11.3 is that Scenes-powered Dashboards are generally available. As a troubleshooting step, could you try adding &scenes=false to the dashboard's URL? If that results in a performance change, that would be helpful info for us :)

robhamnett commented 2 days ago

That indeed seemed to have helped.

NWRichmond commented 2 days ago

Great to know, thanks for confirming @robhamnett. I think we should create a new issue to capture your experience, as it's a different situation than the one described by the author of this issue.

NWRichmond commented 1 day ago

@yzengin I wonder if this could be related to a change in Grafana 11 (https://github.com/grafana/grafana/pull/84778), where we default to using the Label Values endpoint over the Series endpoint. The Label Values endpoint can be unacceptably slow sometimes, for reasons that aren't clear yet (see https://github.com/prometheus/prometheus/issues/14551).

If that's indeed the case, I believe we can work around your performance issue. But first, we'd really appreciate it if you could capture the performance issue in a HAR file, so we can examine the request timings. If privacy is an issue, please see https://github.com/grafana/grafana/issues/95370#issuecomment-2459800486 to learn how to sanitize the HAR file.

Next, let's see if using the Series endpoint improves the performance of your dashboard. In your Prometheus data source configuration (specifically, the Performance section), be sure that the Prometheus type & version are set. Because your Prometheus version is 2.40.0, you have two options:

  1. Once Grafana 11.3.2 is released, upgrade to this version, which will provide a new Use series endpoint toggle. Enable this toggle.
  2. If you'd prefer to stay on Grafana version 11.1.4, you can set the Prometheus data source config's version to anything below 2.24.x. As described in #84778, this will enforce that the Series endpoint is used instead of the Label Values endpoint.

I hope this helps! We look forward to hearing your results :)

melroy89 commented 1 day ago

This is happening on our instance as well running 11.3 ever since the last update, it struggles to load, complaints from end users etc.

Grafana graphs/data/InfluxDB queries are also causing huge issues since Grafana v11.3 (I'm now running v11.3.1). I notice that the data is loading very slowly. And after several graphs loaded, Grafana fails to load more queries. And basically slowly becomes to a halt. The network tab also showed NS_BINDING_ABORTED on the api/ds/query (in my case influxdb) API end-point.

Downgrading back to Grafana v11.2 solved all the performance issues and the queries are very fast again. All graph the load all the data again without any issues after downgrading to v11.2.

So it's definitely not just Prometheus, it's also Influxdb v1 queries.

yzengin commented 22 hours ago

@NWRichmond

Thank you for the clear explanation and the suggested steps. I’ve implemented the Use series endpoint toggle and adjusted the Prometheus data source configuration as you described. The performance issue has been resolved, and everything is functioning as expected now.

Your guidance was precise and effective in addressing the problem. I'll ensure these adjustments are considered for similar scenarios in the future.

Best regards, @yzengin