[Q] Internal cache continuously increasing although the incoming metrics are constant.

rickyari commented 3 years ago

We have a cluster of 9 nodes and the internal cache limit for each node is 20 Mil. Even though the incoming metrics are stable but still the internal cache is increasing rapidly/continuously due to which the cluster nodes stop taking incoming metrics. As a immediate workaround, we increase the internal cache limit on the nodes and restart the go-carbon service. I want your guidance in troubleshooting this issue and finding the root cause so that this issue can be closed permanently.

bom-d-van commented 3 years ago

Hi @rickyari what version of go-carbon are you using? Can you also share the configs of go-carbon instance?

Also how many metrics are you serving in your cluster per instance? How about disk usage? Is there errors shown in the logs?

You should check the io util and iowait time of the servers.

rickyari commented 3 years ago

@bom-d-van Thanks for your reply. I have blackholed a few metrics that were corrupted and that has helped us to maintain a steady internal cache. The only problem what I see here is that the cache is not getting clear although the incoming metrics are not that huge in number. So the internal cache always remain around 38 MIL for each node although the incoming metrics are only 15 MIL. Is there a way to flush out the metrics from cache . I have already tried restarting the service and the server itself but the cache does not reduce. Can you help me with that . I dont mind if we dont get the metrics to disk as the priority is to clear the cache.

bom-d-van commented 3 years ago

Hi @rickyari , can you share some screenshots or data for cache.queueWriteoutTime, cache.overflow, also the cpu, memory, iowait, ioutil of the problematic instance?

If the memory usage is stable, it might not be an issue ~~be that the instance is over-capacity.~~

rickyari commented 3 years ago

@bom-d-van Please find the screenshots below. All of the nodes are having issues so posting the screenshots for for all.

bom-d-van commented 3 years ago

cache.overflow had a spike, which means it was dropping data. What's your cache.max-size configuration? You can try increase it if you have enough memory to spare on the instance. If it happened frequently, you might want to consider expand your cluster.

[cache]
# Limit of in-memory stored points (not metrics)
max-size = 50000000

So the internal cache always remain around 38 MIL for each node although the incoming metrics are only 15 MIL.

Just to be sure, what metrics are you referring to here?

The peak of the cache.queueWriteoutTime is 51, which means it would takes 51 seconds for go-carbon to flush data from cache. During that time, new data points are being cached in memory.

If by the 38 MIL, you meant cache.size, then it's probably normal because it takes time for go-carbon to flush data to disk. Like my suggestion above, you might have to either expand the cluster or increase the memory cache size.

In the CPU usage you shared, there is no suggestion of system, user, and iowait. So I'm not sure the true capacity of your instance.

rickyari commented 3 years ago

@bom-d-van Our cluster runs on i3x large EC2 instances. So here is what I did for a couple of ec2 instances.

Stopped the ec2 instances and started them after a few minutes.
This way we got new nvme ephemeral disks attached to the instances.
Formatted the disk to have xfs filesystem and started go-carbon service on the node.

what I have found after doing this is that the cache size has decreased considerably and now the size for the nodes is around 4 Mil . The other nodes are still running with cache size of 38 Mil.

Just wanted to check how re-formatting the underlying disk has decreased the cache size whereas restarting the service/server does not flush out the cache.

bom-d-van commented 3 years ago

Just wanted to check how re-formatting the underlying disk has decreased the cache size whereas restarting the service/server does not flush out the cache.

Interesting. Is it because it's using XFS and others are using different file system?

Did you compare the disk io of each box are performing? I think it might be due to io performance.

go-graphite / go-carbon

[Q] Internal cache continuously increasing although the incoming metrics are constant. #426