[Q] Why one of the 8 go-carbon nodes in cluster is experience too much read load on CPU while others are normal?

nadeem1701 commented 1 year ago

We have a carbon-graphite cluster with 2 carbon-c-relays and 8 go-carbon nodes. Recently, we are noticing alarms for high CPU load on one of the worker nodes. Upon investigation. Upon investigation, we found that go-carbon is putting too much I/O read load. It is having read load approximately equivalent to the other 7 together.

It is to be noted that we do not use go-carbon to fetch the metrics from the cluster. We use graphite-app (python version) for this purpose. It is not causing the IO issue as we have done per-process CPU analysis.

Screenshot from 2022-12-08 13-56-24

I need help identifying RCA for this abnormality in that one of the worker nodes with the same HW and SW configuration behaves differently.

go-carbon version: 0.14.0 graphite-webapp: 1.2.0

deniszh commented 1 year ago

Hi @nadeem1701

Different load means that read or write load is skewed somehow, and usually that happens because of read and write configuration (i.e. your relay and graphite-web) and not go-carbon itself. Are you sure that node 7 is participating in reads coming from graphite web? Could you please share (anonimized) config for both your relay and graphite-web?

deniszh commented 1 year ago

Ah, I misread graph. Node 7 almost getting no traffic and node 2 is overloaded. Well, default graphite sharding is not really uniform, it's better to use jump hash for that. But please note that graphite-web do not support jump hash directly, you'll need to connect graphite-web to carbonservers (poprt 8080) on go-carbon using CLUSTER_SERVERS then.

nadeem1701 commented 1 year ago

Thank you @deniszh for your very quick response.

The metric values in the legend are the last values at a given time, so we cannot say that Node#7 is getting the least/no traffic. gets a relatively fair amount of traffic (cyan-colored line).

We do not use carbonserver to fetch metrics from the cluster. We have graphite-webapp running on all worker nodes and graphite-webapp with relay configurations on relay-nodes. We can way that we use go-carbon to write metrics and graphite-webapp to read them. If Python based webapp was causing read load on CPU, it could have been understandable. In this case, go -carbon is stressing CPU with READ. We use fnv1a for hashing and did not expect this much imbalance.

relay-configs: #################################################### cluster carbon fnv1a_ch dynamic 172.22.1.1:2003=a 172.22.1.2:2003=b 172.22.1.3:2003=c 172.22.1.4:2003=d 172.22.1.5:2003=e 172.22.1.6:2003=f 172.22.1.7:2003=g 172.22.1.8:2003=h ; match * send to carbon ; statistics submit every 60 seconds reset counters after interval ; #################################################

Graphite-web ################################################# LOG_ROTATION = True LOG_ROTATION_COUNT = 1 DEFAULT_XFILES_FACTOR = 0 CLUSTER_SERVERS = ["172.22.1.1", "172.22.1.2", "172.22.1.3", "172.22.1.4", "172.22.1.5", "172.22.1.6", "172.22.1.7", "172.22.1.8"] USE_WORKER_POOL = True REMOTE_STORE_MERGE_RESULTS = True CARBONLINK_HASHING_TYPE = 'fnv1a_ch' FUNCTION_PLUGINS = [] #################################################

deniszh commented 1 year ago

Hi @nadeem1701 Thanks! Could you please share your go-carbon config please? Tbh I'm bit confused how your setup works. What process listens port 80 on 172.22.1.x servers?

nadeem1701 commented 1 year ago

We have graphite-web running on 80 port of all 172.22.1.x (worker nodes). That is where graphite-web of relay server connects to fetch metrics. This might put some context:

and go-carbon configs:

[common] user = "carbon" graph-prefix = "carbon.agents.{host}" metric-endpoint = "local" metric-interval = "1m0s" max-cpu = 3

[whisper] data-dir = schemas-file = aggregation-file = workers = 6 max-updates-per-second = 0 max-creates-per-second = 0 hard-max-creates-per-second = false sparse-create = false flock = false enabled = true hash-filenames = true

[cache] max-size = 1000000 write-strategy = "max"

[udp] enabled = false

[tcp] listen = "0.0.0.0:2003" enabled = true buffer-size = 0

[pickle] enabled = false

[carbonlink] listen = "0.0.0.0:7002" enabled = true read-timeout = "30s"

[grpc] enabled = false

[tags] enabled = false

[carbonserver] enabled = false

[pprof] enabled = false

deniszh commented 1 year ago

@nadeem1701 : ah, got it. Is main graphite-web has same config as you post above?

nadeem1701 commented 1 year ago

Yes. the graphite-web configs shared earlier are relay-graphite-web's. It queries the graphite-web running on all 8 worker nodes and returns collected metrics.

deniszh commented 1 year ago

if local graphite-web share same set of IPs I think you need to set REMOTE_EXCLUDE_LOCAL=True to avoid loops IIRC. And main graphie-web can be excluded then - you can set requests to all graphite-webs to balance load. But besides that I see no issues with you config TBH and I don't know why it can cause disbalance.

go-graphite / go-carbon

[Q] Why one of the 8 go-carbon nodes in cluster is experience too much read load on CPU while others are normal? #507