High CPU Usage in Grafana when Using GROUP BY with InfluxDB SQL Queries

ap-rose commented 8 months ago

What happened?

We have encountered a significant performance issue in Grafana when executing queries that use the GROUP BY clause with a remote InfluxDB data source. Specifically when such queries are run the CPU usage on the Grafana server spikes to 100% severely impacting the responsiveness and functionality of the Grafana instance.

This performance degradation is observed even when the query results only represent approximately 20,000 lines in the table view suggesting that the issue is not due to an excessive volume of data being returned.

Profiling Data: We have conducted profiling on the Grafana server during the execution of such queries, and the following are some of the notable findings:	Function	Self Time (ms)	Self Time (%)	Cumulative Time (ms)
`runtime.findObject`	2760	14.29	4410	22.83
`runtime.scanobject`	2520	13.04	9150	47.36
`runtime.memclrNoHeapPointers`	2080	10.77	2080	10.77
`runtime.(*mspan).base` (inline)	1100	5.69	1100	5.69
`runtime.greyobject`	900	4.66	2610	13.51
`runtime.bulkBarrierPreWriteSrcOnly`	790	4.09	4700	24.33
`runtime.spanOf` (inline)	770	3.99	910	4.71
`runtime.wbBufFlush1`	730	3.78	3090	15.99
`runtime.heapBitsForAddr`	700	3.62	720	3.73
`runtime.heapBits.next`	640	3.31	800	4.14

Performance profiling suggests significant overhead related to garbage collection and memory management.
The issue seems to be related to the handling of GROUP BY queries and the associated data transformation processes in Grafana.

We are seeking assistance in diagnosing and resolving this issue, as it severely impacts the usability of Grafana for monitoring and visualization tasks involving data from a remote InfluxDB server. Any insights, suggestions, or fixes would be greatly appreciated.

What did you expect to happen?

The CPU usage remains at a manageable level ensuring that the Grafana server remains responsive and functional.

Did this work before?

No such performance degradation was observed with Grafana v8.3.3, indicating that the issue may have been introduced in the updates leading to v10.4.1 - The performance issues began after this update.

How do we reproduce it?

Connect Grafana to a remote InfluxDB data source.
Create a new dashboard and panel.
Enter a query using the GROUP BY clause, similar to the following example: SELECT * FROM "autogen"."ClientData" WHERE $timeFilter GROUP BY "mac" or SELECT * FROM "autogen"."ClientData" WHERE $timeFilter GROUP BY "ip"
Observe the CPU usage of the Grafana server.

Is the bug inside a dashboard panel?

No response

Environment (with versions)?

Grafana: Grafana v10.4.1 (d94d597d84). InfluxDB: 1.8.10 (Note: The InfluxDB server is remote.) OS: Ubuntu-22.04.1 Browser: Chrome 121.0.6167.187 (Official Build) (64-bit)

Grafana platform?

Docker

Datasource(s)?

No response

aangelisc commented 2 months ago

Hi @ap-rose,

Apologies for the delayed response. Are you able to share details of your data source configuration? Also, do you see the same behaviour if you query using Flux? Are you also able to share an example of your schema to allow us to easily attempt to replicate this behaviour.

justinsmalley commented 2 months ago

I have observed a similar issue around this version change with respect to memory usage that I suspect has a similar root cause.

Previously our OSS grafana docker instance ran fine with <200MB memory regardless of how much data our queries returned. After approximately 10.4.1, we now need 5 or more GB. I suspect that some sort of data buffering was added during the substantial refactor or rewrite of the Influxdb datasource plugin, whereas prior it appeared to simply passthrough the data to the client. This buffering would account for both memory and cpu increases.

We are even seeing grafana crash on especially large queries that return row counts in the millions unless the container memory is bumped way up. Returning this much data is not a common use case, but again, we could we could easily do these kinds of queries in the past with almost zero memory usage on the grafana server.

grafana / grafana