[BUG] OOM (out of memory) recurring every 8-9 days

interfan7 commented 6 months ago

Describe the bug When the service is killed by the OS due to OOM, the systemd automatically starts it again. Then, the memory consumption in the machine steadily increases for 8-9 days until next OOM.

Logs I've not noticed something too particular in logs. The OOM log appears in system logs (demsg etc...). I'll be happy to provide specific grep/messages, otherwise the log is huge.

Go-carbon Configuration:

go-carbon.conf:

[common]
user = "carbon"
graph-prefix = "carbon.agents.{host}"
metric-endpoint = "local"
max-cpu = 4
metric-interval = "1m0s"

[whisper]
data-dir = "/data/graphite/whisper/"
schemas-file = "/etc/go-carbon/storage-schemas.conf"
aggregation-file = "/etc/go-carbon/storage-aggregation.conf"
quotas-file = ""
workers = 4
max-updates-per-second = 0
sparse-create = false
physical-size-factor = 0.75
flock = true
compressed = false
enabled = true
hash-filenames = true
remove-empty-file = false
online-migration = false
online-migration-rate = 5
online-migration-global-scope = ""

[cache]
max-size = 100000000
write-strategy = "max"

[udp]
listen = "0.0.0.0:2003"
enabled = true
buffer-size = 0

[tcp]
listen = "0.0.0.0:2003"
enabled = true
buffer-size = 0
compression = ""

[pickle]
listen = ":2004"
max-message-size = 67108864
enabled = true
buffer-size = 0

[carbonlink]
listen = "127.0.0.1:7002"
enabled = true
read-timeout = "30s"

[grpc]
listen = "127.0.0.1:7003"
enabled = true

[tags]
enabled = false
tagdb-url = "http://127.0.0.1:8000"
tagdb-chunk-size = 32
tagdb-update-interval = 100
local-dir = "/data/graphite/tagging/"
tagdb-timeout = "1s"

[carbonserver]
listen = "0.0.0.0:8080"
enabled = true
query-cache-enabled = true
streaming-query-cache-enabled = false
query-cache-size-mb = 0
find-cache-enabled = true
buckets = 100
max-globs = 1000
fail-on-max-globs = false
empty-result-ok = true
do-not-log-404s = false
metrics-as-counters = false
trigram-index = true
internal-stats-dir = ""
cache-scan = false
max-metrics-globbed = 1000000000
max-metrics-rendered = 100000000
trie-index = false
concurrent-index = false
realtime-index = 0
file-list-cache = ""
file-list-cache-version = 1
max-creates-per-second = 0
no-service-when-index-is-not-ready = false
max-inflight-requests = 0
render-trace-logging-enabled = false
[carbonserver.grpc]
listen = ""
enabled = false
read-timeout = "1m0s"
idle-timeout = "1m0s"
write-timeout = "1m0s"
scan-frequency = "5m0s"
quota-usage-report-frequency = "1m0s"

[dump]
enabled = false
path = "/var/lib/graphite/dump/"
restore-per-second = 0

[pprof]
listen = "127.0.0.1:7007"
enabled = false

[[logging]]
logger = ""
file = "/var/log/go-carbon/go-carbon.log"
level = "info"
encoding = "mixed"
encoding-time = "iso8601"
encoding-duration = "seconds"
sample-tick = ""
sample-initial = 0
sample-thereafter = 0

[prometheus]
enabled = false
endpoint = "/metrics"
[prometheus.labels]

[tracing]
enabled = false
jaegerEndpoint = ""
stdout = false
send_timeout = "10s"

storage-schemas.conf:

[carbon]
pattern = ^carbon\.
retentions = 60:90d

[redash-metrics]
pattern = (.*{something I prefer to not share}.*)
retentions = 1m:7y

[production]
pattern = (^production.*|^secTeam.*)
retentions = 1m:60d,15m:120d,1h:3y

[non-production]
pattern = (^non-production.*|^canary.*)
retentions = 1m:14d,30m:30d,1h:180d

[default]
pattern = .*
retentions = 1m:14d,5m:90d,30m:1y

storage-aggregation.conf files:

[min]
pattern = \.min$
xFilesFactor = 0.1
aggregationMethod = min

[max]
pattern = \.max$
xFilesFactor = 0.1
aggregationMethod = max

[sum]
pattern = \.count$
xFilesFactor = 0
aggregationMethod = max

[someTeam_aggregation]
pattern = ^someTeam.*
xFilesFactor = 0
aggregationMethod = average

[default_average]
pattern = .*
xFilesFactor = 0.5
aggregationMethod = average

I wonder whether fields max-size, max-metrics-globbed or max-metrics-rendered have to do with the issue.

Additional context carbonapi service also runs in same server. We've an identical dev server, but it's carbonapi is almost not queried. Interestingly we don't have that issue in the dev server, which suggest the issue has to do with queries. Here is the memory usage graph for prod (left) and dev (right), side by side, for a period of 22 days:

In addition, the systemd status also indicates considerable different, although the prod service is active for only about 1.5 day. Dev:

$ sudo systemctl status go-carbon.service | grep -E 'Memory|Active'
     Active: active (running) since Mon 2023-12-18 10:07:53 UTC; 2 weeks 6 days ago
     Memory: 26.5G

Prod:

$ sudo systemctl status go-carbon.service | grep -E 'Memory|Active'
     Active: active (running) since Sat 2024-01-06 05:18:57 UTC; 1 day 8h ago
     Memory: 42.0G

Although that shall make sense since there are almost zero queries from the dev server.

deniszh commented 6 months ago

Hi @interfan7 ,

How many metrics (ie whisper files) does this instance serve? OOM doesn't always mean bug, go-carbon was designed to use memory instead of disk.

deniszh commented 6 months ago

PS: you can enable pprof interface in config, then you csn run heap dumps and investigate them with go pprof command.

interfan7 commented 5 months ago

@deniszh The number of WSP files is 1,508,149. At least about 200,000 of them are only occasionally fed with datapoints.

I can count how many are updated or accessed in the last 24 hours if that might help?

interfan7 commented 5 months ago

@deniszh I'm not familiar with pprof. That will require some ramp-up time for me. I'll try when I can.

deniszh commented 5 months ago

@deniszh The number of WSP files is 1,508,149. At least about 200,000 of them are only occasionally fed with datapoints.

I can count how many are updated or accessed in the last 24 hours if that might help?

That's not much. I can check our prod memory consumption to compare. OTOH we're using trie index and trigram is disabled iirc. Pprof is great tool for live debug of go programs, try to enable it on localhost and experiment. You can even try it on laptop.

interfan7 commented 5 months ago

@deniszh I've fetched heap from pprof. SVG file Attached. When opened in browser locally it's very convenient to zoom/move through it.

pprof_heap_graphite_go_1

Would you mind to tell whether anything interesting/suspicous is observable from it?

Once we've configured it to be the target of our whole prod, it takes go-carbon not much longer than a day to reach OOM. We plan to change the instance type to move 64GB-->128GB and see whether the memory consumption stops at some point. As you said - OOM doesn't mean there is a memory leak, although it's interesting that it takes it really a while to steadily grow the memory occupation, that's why we thought it might be a leak.

deniszh commented 5 months ago

@interfan7 : that's a memory snapshot, and one snapshot doesn't give you much info. It's more interesting how it changes over time, what exact grows. BTW, I checked our prod servers. For example, for 4M metrics I see that go-carbon consumes 20-30GB RAM. Why do you use so huge 'max-metrics-globbed' and 'max-metrics-rendered' ? If I'm reading svg right half of your data is glob cache. We're perfectly fine using

max-metrics-rendered = 10001
max-metrics-globbed  = 90000

Defaults are less strict, but your numbers unusually high.

interfan7 commented 5 months ago

@deniszh

go-carbon consumes 20-30GB RAM

How do you see that? There are various ways to take service's/processe's mem occupation.

Why do you use so huge 'max-metrics-globbed' and 'max-metrics-rendered' ?

I think when we've set up the node, the Grafana users have complained they have lacked data or metrics in result, so changing this value has seemed to resolve it. However we've just set a very high value without graduallity of try-and-see cycles. Having said that, if that's the cause for the high mem usage, then why isn't the usage fluctuating over time, but rather steadily rising? That's why we've thought maybe there are leaks.

I'll get heap profiles of 2 more points in time between service's start and "end" (i.e. somewhat before OOM). I've read pprof is capable of comparing.

flucrezia commented 5 months ago

Hi @interfan7, have you tried to increase the config attributes max-cpu and workers ? If processing can't keep up with queries rate and load, then memory consumption could increase. I presume that your prod machine has more than 4 vCores to handle 128GB of RAM.

interfan7 commented 4 months ago

@flucrezia The cores seem to relaxed actually so I've not though it could be an issue.

I've decreased the 2 params mentioned above about 2 days ago and I want to see whether the memory will grow to 100GB+ again.

If conclude reducing those params doesn't resolve the issue, at least not for 128GB machine, then I may try your suggestion 🙏🏻

go-graphite / go-carbon

[BUG] OOM (out of memory) recurring every 8-9 days #579