grafana / grafana

The open and composable observability and data visualization platform. Visualize metrics, logs, and traces from multiple sources like Prometheus, Loki, Elasticsearch, InfluxDB, Postgres and many more.
https://grafana.com
GNU Affero General Public License v3.0
65.28k stars 12.18k forks source link

High CPU Usage in Grafana when Using GROUP BY with InfluxDB SQL Queries #85429

Open ap-rose opened 8 months ago

ap-rose commented 8 months ago

What happened?

We have encountered a significant performance issue in Grafana when executing queries that use the GROUP BY clause with a remote InfluxDB data source. Specifically when such queries are run the CPU usage on the Grafana server spikes to 100% severely impacting the responsiveness and functionality of the Grafana instance.

This performance degradation is observed even when the query results only represent approximately 20,000 lines in the table view suggesting that the issue is not due to an excessive volume of data being returned.

Profiling Data: We have conducted profiling on the Grafana server during the execution of such queries, and the following are some of the notable findings: Function Self Time (ms) Self Time (%) Cumulative Time (ms) Cumulative Time (%)
runtime.findObject 2760 14.29 4410 22.83
runtime.scanobject 2520 13.04 9150 47.36
runtime.memclrNoHeapPointers 2080 10.77 2080 10.77
runtime.(*mspan).base (inline) 1100 5.69 1100 5.69
runtime.greyobject 900 4.66 2610 13.51
runtime.bulkBarrierPreWriteSrcOnly 790 4.09 4700 24.33
runtime.spanOf (inline) 770 3.99 910 4.71
runtime.wbBufFlush1 730 3.78 3090 15.99
runtime.heapBitsForAddr 700 3.62 720 3.73
runtime.heapBits.next 640 3.31 800 4.14

We are seeking assistance in diagnosing and resolving this issue, as it severely impacts the usability of Grafana for monitoring and visualization tasks involving data from a remote InfluxDB server. Any insights, suggestions, or fixes would be greatly appreciated.

What did you expect to happen?

The CPU usage remains at a manageable level ensuring that the Grafana server remains responsive and functional.

Did this work before?

No such performance degradation was observed with Grafana v8.3.3, indicating that the issue may have been introduced in the updates leading to v10.4.1 - The performance issues began after this update.

How do we reproduce it?

  1. Connect Grafana to a remote InfluxDB data source.
  2. Create a new dashboard and panel.
  3. Enter a query using the GROUP BY clause, similar to the following example: SELECT * FROM "autogen"."ClientData" WHERE $timeFilter GROUP BY "mac" or SELECT * FROM "autogen"."ClientData" WHERE $timeFilter GROUP BY "ip"
  4. Observe the CPU usage of the Grafana server.

Is the bug inside a dashboard panel?

No response

Environment (with versions)?

Grafana: Grafana v10.4.1 (d94d597d84). InfluxDB: 1.8.10 (Note: The InfluxDB server is remote.) OS: Ubuntu-22.04.1 Browser: Chrome 121.0.6167.187 (Official Build) (64-bit)

Grafana platform?

Docker

Datasource(s)?

No response

aangelisc commented 2 months ago

Hi @ap-rose,

Apologies for the delayed response. Are you able to share details of your data source configuration? Also, do you see the same behaviour if you query using Flux? Are you also able to share an example of your schema to allow us to easily attempt to replicate this behaviour.

justinsmalley commented 2 months ago

I have observed a similar issue around this version change with respect to memory usage that I suspect has a similar root cause.

Previously our OSS grafana docker instance ran fine with <200MB memory regardless of how much data our queries returned. After approximately 10.4.1, we now need 5 or more GB. I suspect that some sort of data buffering was added during the substantial refactor or rewrite of the Influxdb datasource plugin, whereas prior it appeared to simply passthrough the data to the client. This buffering would account for both memory and cpu increases.

We are even seeing grafana crash on especially large queries that return row counts in the millions unless the container memory is bumped way up. Returning this much data is not a common use case, but again, we could we could easily do these kinds of queries in the past with almost zero memory usage on the grafana server.