Open bhalevy opened 1 year ago
I'm fairly sure scylla-manager-agent is supposed to be ran in scylla-helper.slice, which is supposed to limit memory usage to 5%, see https://github.com/scylladb/scylla-manager/blob/5420c4d76bc0f40297693ba43aae920472fe35e5/dist/systemd/scylla-manager-agent.service#L20 and https://github.com/scylladb/scylladb/blob/7d35cf8657ab593eebee31c5ae571c43d4a413a0/dist/common/systemd/scylla-helper.slice#L17
But I see in systemd docs that swap has separate limit knobs. Perhaps swap usage doesn't fall under MemoryMax, but only under MemorySwapMax?
Scylla version (or git commit hash): Scylla Enterprise 2022.2.6
Agent is running in scylla-helper.slice.
The MemoryMax
parameter is respected, see https://github.com/scylladb/scylla-manager/issues/3298#issuecomment-1438637567
The issue above shows the situation where the manager agent hit the limit and was killed.
Why not to set the MemorySwapMax
as @michoecho suggested ?
See https://www.freedesktop.org/software/systemd/man/systemd.resource-control.html#MemorySwapMax=bytes
Agent is running in scylla-helper.slice. The
MemoryMax
parameter is respected, see #3298 (comment) The issue above shows the situation where the manager agent hit the limit and was killed.Why not to set the
MemorySwapMax
as @michoecho suggested ? See https://www.freedesktop.org/software/systemd/man/systemd.resource-control.html#MemorySwapMax=bytes
Maybe as a workaround, but why would the manager agent need over 10GB of memory?
In any case, even after disabling the agent the node in question kept being hit by oom-kill. Apparently due to having too many files, see https://github.com/scylladb/scylla-enterprise/issues/3124#issuecomment-1620045944
@bhalevy , is there agent logs ? I see empty files in 13336.zip
@bhalevy , is there agent logs ? I see empty files in 13336.zip
I didn't see any agent logs. Maybe @d-helios can help find them
we have logs only for the ~ 2, 3 days we will not able to get logs 2 week old
It wpuld still be interesting to see the current footprint of the manager agent and if it is excesive also get the respective logs even if it doesn't trigger the oom killer at the moment.
logs for the last 2 days result.tgz
10.31.188.5/dmesg.txt:kern :warn : [Fri Jun 23 17:25:47 2023] google_guest_ag invoked oom-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=-999
Yes, it shows that scylla-manager-agent definitely blew up.
kern :info : [Fri Jun 23 17:25:47 2023] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
kern :info : [Fri Jun 23 17:25:47 2023] [ 353] 0 353 70052 4501 86016 0 -1000 multipathd
kern :info : [Fri Jun 23 17:25:47 2023] [ 537] 0 537 60260 679 110592 291 0 accounts-daemon
kern :info : [Fri Jun 23 17:25:47 2023] [ 547] 103 547 1910 652 49152 128 -900 dbus-daemon
kern :info : [Fri Jun 23 17:25:47 2023] [ 557] 0 557 7480 790 90112 2020 0 networkd-dispat
kern :info : [Fri Jun 23 17:25:47 2023] [ 581] 114 581 179665 2115 184320 648 0 node_exporter
kern :info : [Fri Jun 23 17:25:47 2023] [ 619] 0 619 951 459 45056 50 0 atd
kern :info : [Fri Jun 23 17:25:47 2023] [ 690] 0 690 59107 681 98304 288 0 polkitd
kern :info : [Fri Jun 23 17:25:47 2023] [ 820] 0 820 1840 385 49152 35 0 agetty
kern :info : [Fri Jun 23 17:25:47 2023] [ 835] 0 835 1459 373 49152 33 0 agetty
kern :info : [Fri Jun 23 17:25:47 2023] [ 839] 113 839 3256 483 45056 57 0 chronyd
kern :info : [Fri Jun 23 17:25:47 2023] [ 840] 113 840 1174 0 45056 45 0 chronyd
kern :info : [Fri Jun 23 17:25:47 2023] [ 11637] 0 11637 540423 5303 1376256 5278 0 journalbeat
kern :info : [Fri Jun 23 17:25:47 2023] [ 835809] 0 835809 2216 484 61440 234 -1000 systemd-udevd
kern :info : [Fri Jun 23 17:25:47 2023] [ 835916] 107 835916 2438 29 53248 52 0 uuidd
kern :info : [Fri Jun 23 17:25:47 2023] [ 836231] 0 836231 347268 1057 221184 728 0 google_osconfig
kern :info : [Fri Jun 23 17:25:47 2023] [ 836289] 100 836289 6820 586 81920 220 0 systemd-network
kern :info : [Fri Jun 23 17:25:47 2023] [ 836291] 0 836291 305950 895 241664 793 -999 google_guest_ag
kern :info : [Fri Jun 23 17:25:47 2023] [ 836421] 0 836421 4346 605 77824 210 0 systemd-logind
kern :info : [Fri Jun 23 17:25:47 2023] [ 836424] 0 836424 2137 492 53248 72 0 cron
kern :info : [Fri Jun 23 17:25:47 2023] [ 836434] 101 836434 6106 679 86016 939 0 systemd-resolve
kern :info : [Fri Jun 23 17:25:47 2023] [ 836437] 0 836437 119713 685 921600 182 -250 systemd-journal
kern :info : [Fri Jun 23 17:25:47 2023] [ 843465] 0 843465 3046 543 61440 238 -1000 sshd
kern :info : [Fri Jun 23 17:25:47 2023] [ 848130] 104 848130 90942 1844 282624 23252 0 rsyslogd
kern :info : [Fri Jun 23 17:25:47 2023] [ 863855] 114 863855 4295068671 7660647 61607936 0 -950 scylla
kern :info : [Fri Jun 23 17:25:47 2023] [ 863999] 114 863999 701465 900 454656 24516 0 scylla-jmx
kern :info : [Fri Jun 23 17:25:47 2023] [ 897161] 114 897161 3053345 192644 23207936 2678762 0 scylla-manager-
kern :info : [Fri Jun 23 17:25:47 2023] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/scylla.slice/scylla-helper.slice/scylla-manager-agent.service,task=scylla-manager-,pid=897161,uid=114
@d-helios Is there a chance to get an access to prometheus to query metrics from Fri Jun 23 17:25:47 2023 +- few hours ? Manager agent emits RClone metrics showing how many bytes are transferred + there is standard set of metrics always available:
@d-helios Is there a chance to get an access to prometheus to query metrics from Fri Jun 23 17:25:47 2023 +- few hours ? Manager agent emits RClone metrics showing how many bytes are transferred + there is standard set of metrics always available:
@karol-kokoszka all metrics available in prometheus you can open https://backoffice.prd.dbaas.scyop.net/ choose required cluster and get all this results
Ok, it's scylla-cloud and cluster 13336. Will check it.
Unfortunately this is cloud environment and we don't collect metrics for scylla-manager-agents..... Due to this bug https://github.com/scylladb/scylla-monitoring/issues/1992
Hard to say what's going on without it, as logs are limited.
Hello @karol-kokoszka , please see https://github.com/scylladb/scylla-enterprise/issues/3121#issuecomment-1646031459
We say that in https://github.com/scylladb/scylla-enterprise/issues/3121 when the oom-killer was invoked on some of the cluster nodes.
2 cases I have the logs are showing scylla-manager-agent consuming almost the whole of the 10GB swap space.
server-60788/dmesg-logs.txt:
server-60792/dmesg-logs.txt: