graphite-project / graphite-web

A highly scalable real-time graphing system
http://graphite.readthedocs.org/
Apache License 2.0
5.88k stars 1.26k forks source link

[Q] Metric dropouts when using GRAPHITE_CLUSTER_SERVERS #2751

Closed kgroshert closed 1 year ago

kgroshert commented 2 years ago

I'm not sure how to debug this, any would be appreciated.

I have a chain of docker-containers:

Grafana (Host A) -> Graphite-Web (Host B) -> Graphite-Web Cluster Servers (Host C, D, E, ...) with local Go-Graphite

If I graph multiple metrics, sometimes one metric is missing or stops right in the middle of data (not all datapoints are returned).

I tried to narrow it down by leaving out the Graphite-Web on Host B, in which case the problem never happens. I attached a screenshot, these are exctly the same panels and graphite-queries but the first row uses the graphite-web with CLUSTER_SERVERS and the second one connects Grafana directly to Graphite on Host C:

image

My graphite queries (per panel) look like this:

graphite_.hamburg01.Interface_GigabitEthernet1_010.out graphite.hamburg01.Interface_GigabitEthernet1_0_10.in

On initial dashboard-load sometimes something is missing, if I press refresh it usually all works. Therefore I would suspect some kind of caching mechanism on graphite-web.

Here are logs with 6 identical panels. 3 Show in+out, 3 show only out (metric 'in' is missing):

`==> info.log <== 2022-05-03,13:29:59.889 :: graphite.render.datalib.fetchData :: lookup and merge of "graphite_.hamburg01.Interface_GigabitEthernet1_010.in" took 0.000227213s 2022-05-03,13:29:59.893 :: graphite.render.datalib.fetchData :: lookup and merge of "graphite.hamburg01.Interface_GigabitEthernet1_010.out" took 0.000151157s 2022-05-03,13:29:59.978 :: graphite.render.datalib.fetchData :: lookup and merge of "graphite.hamburg01.Interface_GigabitEthernet1_010.in" took 0.000159979s 2022-05-03,13:29:59.981 :: graphite.render.datalib.fetchData :: lookup and merge of "graphite.hamburg01.Interface_GigabitEthernet1_010.out" took 0.000142097s 2022-05-03,13:30:00.067 :: graphite.render.datalib.fetchData :: lookup and merge of "graphite.hamburg01.Interface_GigabitEthernet1_010.in" took 0.000159025s 2022-05-03,13:30:00.070 :: graphite.render.datalib.fetchData :: lookup and merge of "graphite.hamburg01.Interface_GigabitEthernet1_0_10.out" took 0.000117064s

==> rendering.log <== 2022-05-03,13:29:59.885 :: Fetched data for [graphite_.hamburg01.Interface_GigabitEthernet1_010.out, graphite.hamburg01.Interface_GigabitEthernet1_010.in] in 0.081730s 2022-05-03,13:29:59.894 :: json rendering time 0.000610 2022-05-03,13:29:59.894 :: Total request processing time 0.099860 2022-05-03,13:29:59.973 :: Fetched data for [graphite.hamburg01.Interface_GigabitEthernet1_010.out, graphite.hamburg01.Interface_GigabitEthernet1_010.in] in 0.071881s 2022-05-03,13:29:59.983 :: json rendering time 0.001315 2022-05-03,13:29:59.983 :: Total request processing time 0.087413 2022-05-03,13:30:00.064 :: Fetched data for [graphite.hamburg01.Interface_GigabitEthernet1_010.out, graphite.hamburg01.Interface_GigabitEthernet1_0_10.in] in 0.075265s 2022-05-03,13:30:00.071 :: json rendering time 0.001026 2022-05-03,13:30:00.071 :: Total request processing time 0.086037

==> cache.log <== 2022-05-03,13:29:59.795 :: Request-Cache miss [e73b89076867897730ee78b0c861a8e4] 2022-05-03,13:29:59.795 :: Data-Cache miss [54b2233a1b24554210cfbf27b0b888f7] 2022-05-03,13:29:59.896 :: Request-Cache miss [e73b89076867897730ee78b0c861a8e4] 2022-05-03,13:29:59.896 :: Data-Cache miss [54b2233a1b24554210cfbf27b0b888f7] 2022-05-03,13:29:59.986 :: Request-Cache miss [e73b89076867897730ee78b0c861a8e4] 2022-05-03,13:29:59.986 :: Data-Cache miss [54b2233a1b24554210cfbf27b0b888f7] `

My config for the cluster-servers looks like that:

GRAPHITE_CLUSTER_SERVERS="http://hostc:3443?format=msgpack,http://hostd:3443?format=msgpack,http://hoste:3443?format=msgpack,http://hostf:3443?format=msgpack,http://hostg:3443?format=msgpack"

Another thing: it seems to affect only the second metric, never the first one (I never get a completely empty result).

Is there anything I can tune in localsettings to narrow down the problem?

Thanks, Kai

kgroshert commented 2 years ago

To try something I added this option to the frontend graphite-web-container:

GRAPHITE_REMOTE_BUFFER_SIZE=0

and this has fixed the problem for now. Is this expected behaviour or a bug?

deniszh commented 2 years ago

Hi @kgroshert

Still have no explanation for behaviour above, but, if you're using go-carbon on hosts C,D,E etc. theoretically you can omit graphite-web on these hosts and connect to carbonserver interface of go-carbon. I.e.

kgroshert commented 2 years ago

After revisiting this, I think I spoke too fast: GRAPHITE_REMOTE_BUFFER_SIZE=0 did not fix it. I will try to reconfigure graphite-web to use carbonserver directly as you suggested and report back.

kgroshert commented 1 year ago

Hi @deniszh,

sorry for the late answer. I implemented your recommendation to connect the frontend graphite-web directly to the carbonservers on port 8000 and this fixes the problem.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.