Open interfan7 opened 6 months ago
Hi @interfan7 ,
How many metrics (ie whisper files) does this instance serve? OOM doesn't always mean bug, go-carbon was designed to use memory instead of disk.
PS: you can enable pprof interface in config, then you csn run heap dumps and investigate them with go pprof command.
@deniszh The number of WSP files is 1,508,149. At least about 200,000 of them are only occasionally fed with datapoints.
I can count how many are updated or accessed in the last 24 hours if that might help?
@deniszh I'm not familiar with pprof. That will require some ramp-up time for me. I'll try when I can.
@deniszh The number of WSP files is 1,508,149. At least about 200,000 of them are only occasionally fed with datapoints.
I can count how many are updated or accessed in the last 24 hours if that might help?
That's not much. I can check our prod memory consumption to compare. OTOH we're using trie index and trigram is disabled iirc. Pprof is great tool for live debug of go programs, try to enable it on localhost and experiment. You can even try it on laptop.
@deniszh I've fetched heap from pprof. SVG file Attached. When opened in browser locally it's very convenient to zoom/move through it.
Would you mind to tell whether anything interesting/suspicous is observable from it?
Once we've configured it to be the target of our whole prod, it takes go-carbon
not much longer than a day to reach OOM.
We plan to change the instance type to move 64GB-->128GB and see whether the memory consumption stops at some point. As you said - OOM doesn't mean there is a memory leak, although it's interesting that it takes it really a while to steadily grow the memory occupation, that's why we thought it might be a leak.
@interfan7 : that's a memory snapshot, and one snapshot doesn't give you much info. It's more interesting how it changes over time, what exact grows. BTW, I checked our prod servers. For example, for 4M metrics I see that go-carbon consumes 20-30GB RAM. Why do you use so huge 'max-metrics-globbed' and 'max-metrics-rendered' ? If I'm reading svg right half of your data is glob cache. We're perfectly fine using
max-metrics-rendered = 10001
max-metrics-globbed = 90000
Defaults are less strict, but your numbers unusually high.
@deniszh
go-carbon consumes 20-30GB RAM
How do you see that? There are various ways to take service's/processe's mem occupation.
Why do you use so huge 'max-metrics-globbed' and 'max-metrics-rendered' ?
I think when we've set up the node, the Grafana users have complained they have lacked data or metrics in result, so changing this value has seemed to resolve it. However we've just set a very high value without graduallity of try-and-see cycles. Having said that, if that's the cause for the high mem usage, then why isn't the usage fluctuating over time, but rather steadily rising? That's why we've thought maybe there are leaks.
I'll get heap profiles of 2 more points in time between service's start and "end" (i.e. somewhat before OOM). I've read pprof is capable of comparing.
Hi @interfan7,
have you tried to increase the config attributes max-cpu
and workers
? If processing can't keep up with queries rate and load, then memory consumption could increase.
I presume that your prod machine has more than 4 vCores to handle 128GB of RAM.
@flucrezia The cores seem to relaxed actually so I've not though it could be an issue.
I've decreased the 2 params mentioned above about 2 days ago and I want to see whether the memory will grow to 100GB+ again.
If conclude reducing those params doesn't resolve the issue, at least not for 128GB machine, then I may try your suggestion 🙏🏻
Describe the bug When the service is killed by the OS due to OOM, the systemd automatically starts it again. Then, the memory consumption in the machine steadily increases for 8-9 days until next OOM.
Logs I've not noticed something too particular in logs. The OOM log appears in system logs (demsg etc...). I'll be happy to provide specific grep/messages, otherwise the log is huge.
Go-carbon Configuration:
go-carbon.conf:
storage-schemas.conf:
storage-aggregation.conf files:
I wonder whether fields
max-size
,max-metrics-globbed
ormax-metrics-rendered
have to do with the issue.Additional context
carbonapi
service also runs in same server. We've an identical dev server, but it'scarbonapi
is almost not queried. Interestingly we don't have that issue in the dev server, which suggest the issue has to do with queries. Here is the memory usage graph for prod (left) and dev (right), side by side, for a period of 22 days:In addition, the systemd status also indicates considerable different, although the prod service is active for only about 1.5 day. Dev:
Prod:
Although that shall make sense since there are almost zero queries from the dev server.