Open bearbear781 opened 1 year ago
below is my nohang.conf
nohang -c /etc/nohang/nohang.conf --check Starting nohang with config /etc/nohang/nohang.conf
Check kernel messages for OOM events
@check_kmsg:
Common zram settings zram_checking_enabled: False
Common PSI settings psi_checking_enabled: False psi_path: /proc/pressure/memory psi_metrics: full_avg10 psi_excess_duration: 30.0 sec psi_post_action_delay: 15.0 sec
Poll rate fill_rate_mem: 6000.0 fill_rate_swap: 2000.0 fill_rate_zram: 4000.0 max_sleep: 3.0 sec min_sleep: 0.1 sec
Warnings and notifications
post_action_gui_notifications: False
hide_corrective_action_type: False
low_memory_warnings_enabled: True
warning_exe:
warning_threshold_min_mem: 1536 MiB, 20.0 %
warning_threshold_min_swap: 25 %
warning_threshold_max_zram: 3456 MiB, 45.0 %
warning_threshold_max_psi: 10.0
min_post_warning_delay: 60.0 sec
env_cache_time: 300.0
Soft threshold soft_threshold_min_mem: 384 MiB, 5.0 % soft_threshold_min_swap: 10 % soft_threshold_max_zram: 4224 MiB, 55.0 % soft_threshold_max_psi: 40.0
Hard threshold hard_threshold_min_mem: 154 MiB, 2.0 % hard_threshold_min_swap: 4 % hard_threshold_max_zram: 4608 MiB, 60.0 % hard_threshold_max_psi: 90.0
Customize victim selection: adjusting badness of processes
7.1. Ignore positive oom_score_adj ignore_positive_oom_score_adj: False
7.2. Adjusting badness of processes by matching with regular expressions 7.2.1. Matching process names with RE patterns (not set) 7.2.2. Matching CGroup_v1-line with RE patterns (not set) 7.2.3. Matching CGroup_v2-line with RE patterns (not set) 7.2.4. Matching eUIDs with RE patterns (not set) 7.2.5. Matching realpath with RE patterns (not set) 7.2.6. Matching /proc/[pid]/cwd realpath with RE patterns (not set) 7.2.7. Matching cmdlines with RE patterns (not set) 7.2.8. Matching environ with RE patterns (not set)
Customize soft corrective actions (not set)
Misc max_soft_exit_time: 10.0 sec post_kill_exe: (my_action) min_badness: 1 post_soft_action_delay: 3.0 sec post_zombie_delay: 0.1 sec victim_cache_time: 10.0 sec exe_timeout: 20.0 sec
Verbosity print_config_at_startup: False print_mem_check_results: True min_mem_report_interval: 60.0 sec print_proc_table: False extra_table_info: None print_victim_status: True print_victim_cmdline: False max_victim_ancestry_depth: 3 print_statistics: True debug_gui_notifications: False debug_psi: False debug_sleep: False debug_threading: False separate_log: True
The post_kill_exe
command starts only after the hard corrective action.
Sorry it's not mentioned in the documentation.
You can configure part 8: Customize soft corrective actions
to CHANGE soft corrective action: a command will be executed instead of sending the SIGTERM
signal.
For example:
@SOFT_ACTION_RE_NAME /// your_command...
I don't need to change the soft correction action, my post_kill_exe is to execute curl to send a notification to telegram after hard correction
Here is the log
2023-08-10 16:53:26,460: >>=== STARTING implement_corrective_action() ====>> 2023-08-10 16:53:26,460: Memory status that requires corrective actions: MemAvailable [383 MiB, 5.0 %] <= soft_threshold_min_mem [384 MiB, 5.0 %] SwapFree [104 MiB, 5.1 %] <= soft_threshold_min_swap [205 MiB, 10.0 %] 2023-08-10 16:53:26,462: Found 9 tasks with non-zero oom_score (except init and self) in 2ms 2023-08-10 16:53:26,463: TOP-9 tasks by badness: 2023-08-10 16:53:26,463: Name PID badness 2023-08-10 16:53:26,463: --------------- ------- ------- 2023-08-10 16:53:26,463: mysqld 903309 849 2023-08-10 16:53:26,463: monit 7469 17 2023-08-10 16:53:26,463: vmagent 329736 5 2023-08-10 16:53:26,463: metricbeat 7217 4 2023-08-10 16:53:26,463: pmm-agent 329710 3 2023-08-10 16:53:26,463: ssm-agent-worke 3352 1 2023-08-10 16:53:26,463: wazuh-modulesd 51276 1 2023-08-10 16:53:26,463: mysqld_exporter 329735 1 2023-08-10 16:53:26,463: node_exporter 329737 1 2023-08-10 16:53:26,463: TOP printed in 1ms; process with highest badness: PID: 903309, name: mysqld, badness: 849 2023-08-10 16:53:26,463: Recheck memory levels... 2023-08-10 16:53:26,464: Memory status that requires corrective actions: MemAvailable [383 MiB, 5.0 %] <= soft_threshold_min_mem [384 MiB, 5.0 %] SwapFree [104 MiB, 5.1 %] <= soft_threshold_min_swap [205 MiB, 10.0 %] 2023-08-10 16:53:26,464: Victim status (found in 0ms): PID: 903309, name: mysqld, state: S (sleeping), EUID: 27, SID: 902753 (), lifetime: 4d 3h 32min 2s badness: 849, oom_score: 849, oom_score_adj: 0 Vm, MiB: Size: 10762, RSS: 6507 (Anon: 6507, File: 0, Shmem: 0), Swap: 1736 cgroup_v1: /system.slice/mysqld.service cgroup_v2: ancestry: PID 902781 (mysqld_safe) <= PID 1 (systemd) exe realpath: /usr/sbin/mysqld cwd realpath: /var/lib/mysql 2023-08-10 16:53:26,464: Memory info, MiB: 2023-08-10 16:53:26,464: total=7680, used=7029, free=118, available=383, shared=0, buffers=0, cache=534, 2023-08-10 16:53:26,464: swap_total=2048, swap_used=1944, swapfree=104 2023-08-10 16:53:26,464: Implementing a corrective action: Sending SIGTERM to the victim 2023-08-10 16:53:26,464: OK; total response time: 4ms 2023-08-10 16:53:26,464: What happened in the last 14d 10h 11min 19s: [ OK ] Sending SIGTERM to mysqld: 7 [ OK ] Sending SIGKILL to mysqld: 1 2023-08-10 16:53:26,495: The victim doesn't respond on corrective action in 0.031s_ 2023-08-10 16:53:26,495: Memory status after implementing a corrective action: MemAvailable: 382.8 MiB, SwapFree: 104.5 MiB 2023-08-10 16:53:26,595: <<=== FINISHING implement_corrective_action() ===<<
Here is the log
2023-08-10 16:53:26,460: >>=== STARTING implement_corrective_action() ====>> 2023-08-10 16:53:26,460: Memory status that requires corrective actions: MemAvailable [383 MiB, 5.0 %] <= soft_threshold_min_mem [384 MiB, 5.0 %] SwapFree [104 MiB, 5.1 %] <= soft_threshold_min_swap [205 MiB, 10.0 %] 2023-08-10 16:53:26,462: Found 9 tasks with non-zero oom_score (except init and self) in 2ms 2023-08-10 16:53:26,463: TOP-9 tasks by badness: 2023-08-10 16:53:26,463: Name PID badness 2023-08-10 16:53:26,463: --------------- ------- ------- 2023-08-10 16:53:26,463: mysqld 903309 849 2023-08-10 16:53:26,463: monit 7469 17 2023-08-10 16:53:26,463: vmagent 329736 5 2023-08-10 16:53:26,463: metricbeat 7217 4 2023-08-10 16:53:26,463: pmm-agent 329710 3 2023-08-10 16:53:26,463: ssm-agent-worke 3352 1 2023-08-10 16:53:26,463: wazuh-modulesd 51276 1 2023-08-10 16:53:26,463: mysqld_exporter 329735 1 2023-08-10 16:53:26,463: node_exporter 329737 1 2023-08-10 16:53:26,463: TOP printed in 1ms; process with highest badness: PID: 903309, name: mysqld, badness: 849 2023-08-10 16:53:26,463: Recheck memory levels... 2023-08-10 16:53:26,464: Memory status that requires corrective actions: MemAvailable [383 MiB, 5.0 %] <= soft_threshold_min_mem [384 MiB, 5.0 %] SwapFree [104 MiB, 5.1 %] <= soft_threshold_min_swap [205 MiB, 10.0 %] 2023-08-10 16:53:26,464: Victim status (found in 0ms): PID: 903309, name: mysqld, state: S (sleeping), EUID: 27, SID: 902753 (), lifetime: 4d 3h 32min 2s badness: 849, oom_score: 849, oom_score_adj: 0 Vm, MiB: Size: 10762, RSS: 6507 (Anon: 6507, File: 0, Shmem: 0), Swap: 1736 cgroup_v1: /system.slice/mysqld.service cgroup_v2: ancestry: PID 902781 (mysqld_safe) <= PID 1 (systemd) exe realpath: /usr/sbin/mysqld cwd realpath: /var/lib/mysql 2023-08-10 16:53:26,464: Memory info, MiB: 2023-08-10 16:53:26,464: total=7680, used=7029, free=118, available=383, shared=0, buffers=0, cache=534, 2023-08-10 16:53:26,464: swap_total=2048, swap_used=1944, swapfree=104 2023-08-10 16:53:26,464: Implementing a corrective action: Sending SIGTERM to the victim 2023-08-10 16:53:26,464: OK; total response time: 4ms 2023-08-10 16:53:26,464: What happened in the last 14d 10h 11min 19s: [ OK ] Sending SIGTERM to mysqld: 7 [ OK ] Sending SIGKILL to mysqld: 1 2023-08-10 16:53:26,495: The victim doesn't respond on corrective action in 0.031s_ 2023-08-10 16:53:26,495: Memory status after implementing a corrective action: MemAvailable: 382.8 MiB, SwapFree: 104.5 MiB 2023-08-10 16:53:26,595: <<=== FINISHING implement_corrective_action() ===<<
I compared and found that when post_kill_exe is executed normally, there will be this line in the log, but the log I provided does not have this line, so post_kill_exe is not executed max_soft_exit_time (value=10.0s) is exceeded for the victim, hard corrective action will be applied
does not have this line, because it is soft corrective action:
MemAvailable [383 MiB, 5.0 %] <= soft_threshold_min_mem [384 MiB, 5.0 %] Sending SIGTERM to the victim
The victim doesn't respond on corrective action in 0.031s
The process takes some time to complete. The hard threshold was not exceeded (in this log).
Could you please provide logs with HARD corrective action and post_kill_exe
was not executed?
Is there a trigger like post_kill_exe that can be used in soft corrective action?
trigger like post_kill_exe that can be used in soft corrective action
@SOFT_ACTION_RE_NAME /// sh -c "kill -TERM $PID && your_command..."
trigger like post_kill_exe that can be used in soft corrective action
@SOFT_ACTION_RE_NAME /// sh -c "kill -TERM $PID && your_command..."
I will try for it, any problem will update again, thank you
trigger like post_kill_exe that can be used in soft corrective action
@SOFT_ACTION_RE_NAME /// sh -c "kill -TERM $PID && your_command..."
Can this be placed at the end of the config file? Or does it have to be in part 8?
Can be placed anywhere in the configuration file.
I have configured post_kill_exe in nohang.conf to execute certain actions after killing a process. However, I've noticed that post_kill_exe is not triggered consistently for every kill event. It seems that it is triggered when the message "The victim became a zombie in 0.01s" is displayed, but not when "The victim doesn't respond on corrective action in 0.031s" is displayed.