grafana / grafana

The open and composable observability and data visualization platform. Visualize metrics, logs, and traces from multiple sources like Prometheus, Loki, Elasticsearch, InfluxDB, Postgres and many more.
https://grafana.com
GNU Affero General Public License v3.0
63.87k stars 11.96k forks source link

high grafana cpu usage (80-150%) - with go cpu profile attached #92429

Open bob454522 opened 2 weeks ago

bob454522 commented 2 weeks ago

What happened?

issue: Unusually high cpu usage for grafana, also since 10.2.3 to 11.1.3 update cpu usage got even higher (nearly doubled with no other changes). image

Im the only user of grafana, although i do have 4x dashboards that are displayed 24/7 on monitors (via different client PCs that connect to grafana via the local network - ie all are on same 172.17.2.X subnet)) - i have tried turning these OFF but i still see the high cpu usage. The server is vmware esxi based on a E5-2680 v4 , 256g ecc ram, with low overall cpu usage (ie ~10-20% for entire system). Grafana VM has 4x vCPUs and 6gb ram.

I do have about 7x different machines report data to Grafana Live , however the total messages/min (in total for all reporters) is about 600 messages per 1 min (so about 10 per sec). (note this rate was still the case while on grafana v9 and v8 and i did not see such high grafana cpu usages, ie usage was around 10-30% back then)

Via troubleshooting this over the past few months, I tried switching to using WebSocket -> Grafana Live (instead of curl POSTS -> grafana live). this made NO impact/change. The reason i tired this - it seemed to me from the go CPU profile results that the source of the high grafana CPU usage was related to the http POSTs -> grafana live, however after a go cpu profile while using websockets-> Grafana Live, it now seems to be related to API Auth (im not a coder, nor familiar with GO language, i'm just making a guess based on the flame graphs / flow chart from go cpu profile)

IE based on data like this: image

Here is a Go cpu profile (when i was using curl POST -> grafana live) :

cpu_profile.zip profile.pb.gz

And a trace level log i took around this same time: grafana.log

image

Here is a Go cpu profile (when i was using webSockets -> grafana live) :

profile_webSockets.pb.gz

flame graph: image

(nb; to see additional "verbose" details please see this post that i have now closed and moved to this new post, so as to make the issue more clear and concise): https://github.com/grafana/grafana/issues/81401

Im free To provide any additional details. thank you

What did you expect to happen?

lower cpu usage as some users are sending 100s of mesages per sec to grafana live (If Gruffana Live is even the source of this high CPU usage, I'm not even sure of the source)

Did this work before?

in grafana v9 i did not see such high cpu usage for the grafana process (and grafana was under very similar usage)

How do we reproduce it?

  1. (not 100% sure of source)

Is the bug inside a dashboard panel?

No response

Environment (with versions)?

Grafana: 11.1.3 OS: Debian 8.11 (also same issue with a brand new Debian 12 install, and Temporarily moving Grafana over to that setup) Browser: chrome and firefox

Grafana platform?

A downloaded binary

Datasource(s)?

Graphite / influxDB/ grafana Live streaming

marefr commented 2 weeks ago

It sounds/looks like you overload Grafana using the live feature. Profiles seems like you're using API keys for sending data to Grafana Live, please try migrate to service accounts since API keys is deprecated and going away https://grafana.com/docs/grafana/latest/administration/service-accounts/#service-accounts

v11.2.0 has been released, feel free to try that as well.

kalleep commented 2 weeks ago

Hey assuming you are using a service account token I think I have found at least one issue.

It seems like since https://github.com/grafana/grafana/pull/79254 (10.3.x) we are hashing the token twice for each request. Once to perform authentication and then once more to update last used. I will make a fix for that. But There could be other issues as well so I will keep investigating

bob454522 commented 2 weeks ago

It sounds/looks like you overload Grafana using the live feature. Profiles seems like you're using API keys for sending data to Grafana Live, please try migrate to service accounts since API keys is deprecated and going away https://grafana.com/docs/grafana/latest/administration/service-accounts/#service-accounts

v11.2.0 has been released, feel free to try that as well.

Profiles seems like you're using API keys for sending data to Grafana Live, please try migrate to service accounts...

Im pretty sure i am already making use of Service Accounts (I know I've created them on the grafana side, and in my code that sends the data to Grafana Live, my tokens look like this: -H="Authorization:Bearer glsa_Vz9qmAA....

(im Assuming the GLSA part means grafana X service account?) ie full sample:

echo "$ws_data" | websocat -H="Authorization:Bearer glsa_Vz9qmAATNcXqAVJI3nXXXX" ws://172.17.69.40:3000/api/live/push/livesshBW_"$location"

or in my python examples:

GRAFANA_LIVE_MEASUREMENTS_URL = 'ws://172.17.69.40:3000/api/live/push/iotaMain'
GRAFANA_API_TOKEN = 'glsa_Vz9qmAATNcXq...'
...
try:
    ws = create_connection(GRAFANA_LIVE_MEASUREMENTS_URL, header=["Authorization: Bearer {}".format(GRAFANA_API_TOKEN)], timeout=5)
    ws.send(line_protocol)
...

v11.2.0 has been released, feel free to try that as well.

Im now on 11.2.0 release (as of today) - cpu usage is the same (ie still elevated) no change vs 11.1.x

It sounds/looks like you overload Grafana using the live

That is possible, however while im sending from about 8x different sources into grafana live, most are under 10-15 Messages/min (with the exception of 3x that are 100-400 msg/min)

if i stop the 100-400/min senders, i still see high grafana cpu usage (ie maybe grafana goes from ~80-160% to ~70% usage)

thank you

kalleep commented 2 weeks ago

@bob454522 https://github.com/grafana/grafana/pull/92598 should fix this issue. I have not found anything else that has changed to cause this. I have backported it to 11.2.x so it will be included in next 11.2 release

bob454522 commented 2 weeks ago

thanks, not sure if this is a valid test, im doing here, but i tried disabling auth fully, ie in my grafana.ini:

[auth.anonymous]
enabled = true
org_name = Main Org.
org_role = Admin

(then restart grafana + apache -- i say maybe not a valid test in the sense that perhaps the hashing you are referring to is still going on inside of grafana, even with no auth in use)

then changed the code for my 3x largest grafana live senders (ie the 3x that send about 200 msg/min each), so that the AUTH header is not sent at all:

echo "$ws_data" | websocat ws://172.17.69.40:3000/api/live/push/sshBW_"$location"

instead of prior:
 echo "$ws_data" | websocat -H="Authorization:Bearer glsa_Vz9qmAATN....

and it did not make any change to the grafana process's cpu usage (see graph, this is of the grafana process itself), the dip is when i stopped all 3x senders, made the changes to their code, then restarted them (which took a few minutes, thus the drop during the time of none being sent from these 3x).

Image

(a longer term graph, fwiw- about 3am is when i upgraded from grafana 11.1.3 to 11.2) Image

bob454522 commented 1 week ago

@bob454522 #92598 should fix this issue. I have not found anything else that has changed to cause this. I have backported it to 11.2.x so it will be included in next 11.2 release

excellent- i tried running the 11.3.0 nightly (11.3.0-75433_75433_linux_amd64.deb) , and did see grafana cpu usage go down by almost half!

(im still not clear on why with 11.2: fully disabling grafana auth + removing the header: Authorization:Bearer gl... from my code that sends data to grafana , did not change the cpu usage at all, but clearly the issue is related to token hashing, as 11.3 cpu looks more reasonable).

(11.2.0 release, vs 11.3.0 nightly): Image

Image

kalleep commented 1 week ago

Great. Yeah that bit confuses me as well. Because without the header we never try to decode and hash the token. Maybe there is something else that is also contributing to cpu usage that was fixed in 11.3. But the the profiles you provided it should be related to token hashing..