Open nadeem1701 opened 1 year ago
Hi @nadeem1701
Different load means that read or write load is skewed somehow, and usually that happens because of read and write configuration (i.e. your relay and graphite-web) and not go-carbon itself. Are you sure that node 7 is participating in reads coming from graphite web? Could you please share (anonimized) config for both your relay and graphite-web?
Ah, I misread graph. Node 7 almost getting no traffic and node 2 is overloaded. Well, default graphite sharding is not really uniform, it's better to use jump hash for that. But please note that graphite-web do not support jump hash directly, you'll need to connect graphite-web to carbonservers (poprt 8080) on go-carbon using CLUSTER_SERVERS then.
Thank you @deniszh for your very quick response.
The metric values in the legend are the last values at a given time, so we cannot say that Node#7 is getting the least/no traffic. gets a relatively fair amount of traffic (cyan-colored line).
We do not use carbonserver to fetch metrics from the cluster. We have graphite-webapp running on all worker nodes and graphite-webapp with relay configurations on relay-nodes. We can way that we use go-carbon to write metrics and graphite-webapp to read them. If Python based webapp was causing read load on CPU, it could have been understandable. In this case, go -carbon is stressing CPU with READ. We use fnv1a for hashing and did not expect this much imbalance.
relay-configs: #################################################### cluster carbon fnv1a_ch dynamic 172.22.1.1:2003=a 172.22.1.2:2003=b 172.22.1.3:2003=c 172.22.1.4:2003=d 172.22.1.5:2003=e 172.22.1.6:2003=f 172.22.1.7:2003=g 172.22.1.8:2003=h ; match * send to carbon ; statistics submit every 60 seconds reset counters after interval ; #################################################
Graphite-web ################################################# LOG_ROTATION = True LOG_ROTATION_COUNT = 1 DEFAULT_XFILES_FACTOR = 0 CLUSTER_SERVERS = ["172.22.1.1", "172.22.1.2", "172.22.1.3", "172.22.1.4", "172.22.1.5", "172.22.1.6", "172.22.1.7", "172.22.1.8"] USE_WORKER_POOL = True REMOTE_STORE_MERGE_RESULTS = True CARBONLINK_HASHING_TYPE = 'fnv1a_ch' FUNCTION_PLUGINS = [] #################################################
Hi @nadeem1701 Thanks! Could you please share your go-carbon config please? Tbh I'm bit confused how your setup works. What process listens port 80 on 172.22.1.x servers?
We have graphite-web running on 80 port of all 172.22.1.x (worker nodes). That is where graphite-web of relay server connects to fetch metrics. This might put some context:
and go-carbon configs:
[common] user = "carbon" graph-prefix = "carbon.agents.{host}" metric-endpoint = "local" metric-interval = "1m0s" max-cpu = 3
[whisper]
data-dir =
[cache] max-size = 1000000 write-strategy = "max"
[udp] enabled = false
[tcp] listen = "0.0.0.0:2003" enabled = true buffer-size = 0
[pickle] enabled = false
[carbonlink] listen = "0.0.0.0:7002" enabled = true read-timeout = "30s"
[grpc] enabled = false
[tags] enabled = false
[carbonserver] enabled = false
[pprof] enabled = false
@nadeem1701 : ah, got it. Is main graphite-web has same config as you post above?
Yes. the graphite-web configs shared earlier are relay-graphite-web's. It queries the graphite-web running on all 8 worker nodes and returns collected metrics.
if local graphite-web share same set of IPs I think you need to set REMOTE_EXCLUDE_LOCAL=True
to avoid loops IIRC. And main graphie-web can be excluded then - you can set requests to all graphite-webs to balance load.
But besides that I see no issues with you config TBH and I don't know why it can cause disbalance.
We have a carbon-graphite cluster with 2 carbon-c-relays and 8 go-carbon nodes. Recently, we are noticing alarms for high CPU load on one of the worker nodes. Upon investigation. Upon investigation, we found that go-carbon is putting too much I/O read load. It is having read load approximately equivalent to the other 7 together.
It is to be noted that we do not use go-carbon to fetch the metrics from the cluster. We use graphite-app (python version) for this purpose. It is not causing the IO issue as we have done per-process CPU analysis.
I need help identifying RCA for this abnormality in that one of the worker nodes with the same HW and SW configuration behaves differently.
go-carbon version: 0.14.0 graphite-webapp: 1.2.0