Open ap-rose opened 8 months ago
Hi @ap-rose,
Apologies for the delayed response. Are you able to share details of your data source configuration? Also, do you see the same behaviour if you query using Flux? Are you also able to share an example of your schema to allow us to easily attempt to replicate this behaviour.
I have observed a similar issue around this version change with respect to memory usage that I suspect has a similar root cause.
Previously our OSS grafana docker instance ran fine with <200MB memory regardless of how much data our queries returned. After approximately 10.4.1, we now need 5 or more GB. I suspect that some sort of data buffering was added during the substantial refactor or rewrite of the Influxdb datasource plugin, whereas prior it appeared to simply passthrough the data to the client. This buffering would account for both memory and cpu increases.
We are even seeing grafana crash on especially large queries that return row counts in the millions unless the container memory is bumped way up. Returning this much data is not a common use case, but again, we could we could easily do these kinds of queries in the past with almost zero memory usage on the grafana server.
What happened?
We have encountered a significant performance issue in Grafana when executing queries that use the GROUP BY clause with a remote InfluxDB data source. Specifically when such queries are run the CPU usage on the Grafana server spikes to 100% severely impacting the responsiveness and functionality of the Grafana instance.
This performance degradation is observed even when the query results only represent approximately 20,000 lines in the table view suggesting that the issue is not due to an excessive volume of data being returned.
runtime.findObject
runtime.scanobject
runtime.memclrNoHeapPointers
runtime.(*mspan).base
(inline)runtime.greyobject
runtime.bulkBarrierPreWriteSrcOnly
runtime.spanOf
(inline)runtime.wbBufFlush1
runtime.heapBitsForAddr
runtime.heapBits.next
We are seeking assistance in diagnosing and resolving this issue, as it severely impacts the usability of Grafana for monitoring and visualization tasks involving data from a remote InfluxDB server. Any insights, suggestions, or fixes would be greatly appreciated.
What did you expect to happen?
The CPU usage remains at a manageable level ensuring that the Grafana server remains responsive and functional.
Did this work before?
No such performance degradation was observed with Grafana v8.3.3, indicating that the issue may have been introduced in the updates leading to v10.4.1 - The performance issues began after this update.
How do we reproduce it?
SELECT * FROM "autogen"."ClientData" WHERE $timeFilter GROUP BY "mac"
orSELECT * FROM "autogen"."ClientData" WHERE $timeFilter GROUP BY "ip"
Is the bug inside a dashboard panel?
No response
Environment (with versions)?
Grafana: Grafana v10.4.1 (d94d597d84). InfluxDB: 1.8.10 (Note: The InfluxDB server is remote.) OS: Ubuntu-22.04.1 Browser: Chrome 121.0.6167.187 (Official Build) (64-bit)
Grafana platform?
Docker
Datasource(s)?
No response