hakavlad / nohang

A sophisticated low memory handler for Linux
MIT License
1.04k stars 48 forks source link

Nohang with post_kill_exe #132

Open bearbear781 opened 1 year ago

bearbear781 commented 1 year ago

I have configured post_kill_exe in nohang.conf to execute certain actions after killing a process. However, I've noticed that post_kill_exe is not triggered consistently for every kill event. It seems that it is triggered when the message "The victim became a zombie in 0.01s" is displayed, but not when "The victim doesn't respond on corrective action in 0.031s" is displayed.

bearbear781 commented 1 year ago

below is my nohang.conf

nohang -c /etc/nohang/nohang.conf --check Starting nohang with config /etc/nohang/nohang.conf

  1. Check kernel messages for OOM events @check_kmsg: @debug_kmsg:

  2. Common zram settings zram_checking_enabled: False

  3. Common PSI settings psi_checking_enabled: False psi_path: /proc/pressure/memory psi_metrics: full_avg10 psi_excess_duration: 30.0 sec psi_post_action_delay: 15.0 sec

  4. Poll rate fill_rate_mem: 6000.0 fill_rate_swap: 2000.0 fill_rate_zram: 4000.0 max_sleep: 3.0 sec min_sleep: 0.1 sec

  5. Warnings and notifications post_action_gui_notifications: False hide_corrective_action_type: False low_memory_warnings_enabled: True warning_exe:
    warning_threshold_min_mem: 1536 MiB, 20.0 % warning_threshold_min_swap: 25 % warning_threshold_max_zram: 3456 MiB, 45.0 % warning_threshold_max_psi: 10.0 min_post_warning_delay: 60.0 sec env_cache_time: 300.0

  6. Soft threshold soft_threshold_min_mem: 384 MiB, 5.0 % soft_threshold_min_swap: 10 % soft_threshold_max_zram: 4224 MiB, 55.0 % soft_threshold_max_psi: 40.0

  7. Hard threshold hard_threshold_min_mem: 154 MiB, 2.0 % hard_threshold_min_swap: 4 % hard_threshold_max_zram: 4608 MiB, 60.0 % hard_threshold_max_psi: 90.0

  8. Customize victim selection: adjusting badness of processes

7.1. Ignore positive oom_score_adj ignore_positive_oom_score_adj: False

7.2. Adjusting badness of processes by matching with regular expressions 7.2.1. Matching process names with RE patterns (not set) 7.2.2. Matching CGroup_v1-line with RE patterns (not set) 7.2.3. Matching CGroup_v2-line with RE patterns (not set) 7.2.4. Matching eUIDs with RE patterns (not set) 7.2.5. Matching realpath with RE patterns (not set) 7.2.6. Matching /proc/[pid]/cwd realpath with RE patterns (not set) 7.2.7. Matching cmdlines with RE patterns (not set) 7.2.8. Matching environ with RE patterns (not set)

  1. Customize soft corrective actions (not set)

  2. Misc max_soft_exit_time: 10.0 sec post_kill_exe: (my_action) min_badness: 1 post_soft_action_delay: 3.0 sec post_zombie_delay: 0.1 sec victim_cache_time: 10.0 sec exe_timeout: 20.0 sec

  3. Verbosity print_config_at_startup: False print_mem_check_results: True min_mem_report_interval: 60.0 sec print_proc_table: False extra_table_info: None print_victim_status: True print_victim_cmdline: False max_victim_ancestry_depth: 3 print_statistics: True debug_gui_notifications: False debug_psi: False debug_sleep: False debug_threading: False separate_log: True

hakavlad commented 1 year ago

The post_kill_exe command starts only after the hard corrective action.

Sorry it's not mentioned in the documentation.

You can configure part 8: Customize soft corrective actions to CHANGE soft corrective action: a command will be executed instead of sending the SIGTERM signal.

For example:

@SOFT_ACTION_RE_NAME  /// your_command...
bearbear781 commented 1 year ago

I don't need to change the soft correction action, my post_kill_exe is to execute curl to send a notification to telegram after hard correction

bearbear781 commented 1 year ago

Here is the log

2023-08-10 16:53:26,460: >>=== STARTING implement_corrective_action() ====>> 2023-08-10 16:53:26,460: Memory status that requires corrective actions: MemAvailable [383 MiB, 5.0 %] <= soft_threshold_min_mem [384 MiB, 5.0 %] SwapFree [104 MiB, 5.1 %] <= soft_threshold_min_swap [205 MiB, 10.0 %] 2023-08-10 16:53:26,462: Found 9 tasks with non-zero oom_score (except init and self) in 2ms 2023-08-10 16:53:26,463: TOP-9 tasks by badness: 2023-08-10 16:53:26,463: Name PID badness 2023-08-10 16:53:26,463: --------------- ------- ------- 2023-08-10 16:53:26,463: mysqld 903309 849 2023-08-10 16:53:26,463: monit 7469 17 2023-08-10 16:53:26,463: vmagent 329736 5 2023-08-10 16:53:26,463: metricbeat 7217 4 2023-08-10 16:53:26,463: pmm-agent 329710 3 2023-08-10 16:53:26,463: ssm-agent-worke 3352 1 2023-08-10 16:53:26,463: wazuh-modulesd 51276 1 2023-08-10 16:53:26,463: mysqld_exporter 329735 1 2023-08-10 16:53:26,463: node_exporter 329737 1 2023-08-10 16:53:26,463: TOP printed in 1ms; process with highest badness: PID: 903309, name: mysqld, badness: 849 2023-08-10 16:53:26,463: Recheck memory levels... 2023-08-10 16:53:26,464: Memory status that requires corrective actions: MemAvailable [383 MiB, 5.0 %] <= soft_threshold_min_mem [384 MiB, 5.0 %] SwapFree [104 MiB, 5.1 %] <= soft_threshold_min_swap [205 MiB, 10.0 %] 2023-08-10 16:53:26,464: Victim status (found in 0ms): PID: 903309, name: mysqld, state: S (sleeping), EUID: 27, SID: 902753 (), lifetime: 4d 3h 32min 2s badness: 849, oom_score: 849, oom_score_adj: 0 Vm, MiB: Size: 10762, RSS: 6507 (Anon: 6507, File: 0, Shmem: 0), Swap: 1736 cgroup_v1: /system.slice/mysqld.service cgroup_v2: ancestry: PID 902781 (mysqld_safe) <= PID 1 (systemd) exe realpath: /usr/sbin/mysqld cwd realpath: /var/lib/mysql 2023-08-10 16:53:26,464: Memory info, MiB: 2023-08-10 16:53:26,464: total=7680, used=7029, free=118, available=383, shared=0, buffers=0, cache=534, 2023-08-10 16:53:26,464: swap_total=2048, swap_used=1944, swapfree=104 2023-08-10 16:53:26,464: Implementing a corrective action: Sending SIGTERM to the victim 2023-08-10 16:53:26,464: OK; total response time: 4ms 2023-08-10 16:53:26,464: What happened in the last 14d 10h 11min 19s: [ OK ] Sending SIGTERM to mysqld: 7 [ OK ] Sending SIGKILL to mysqld: 1 2023-08-10 16:53:26,495: The victim doesn't respond on corrective action in 0.031s_ 2023-08-10 16:53:26,495: Memory status after implementing a corrective action: MemAvailable: 382.8 MiB, SwapFree: 104.5 MiB 2023-08-10 16:53:26,595: <<=== FINISHING implement_corrective_action() ===<<

bearbear781 commented 1 year ago

Here is the log

2023-08-10 16:53:26,460: >>=== STARTING implement_corrective_action() ====>> 2023-08-10 16:53:26,460: Memory status that requires corrective actions: MemAvailable [383 MiB, 5.0 %] <= soft_threshold_min_mem [384 MiB, 5.0 %] SwapFree [104 MiB, 5.1 %] <= soft_threshold_min_swap [205 MiB, 10.0 %] 2023-08-10 16:53:26,462: Found 9 tasks with non-zero oom_score (except init and self) in 2ms 2023-08-10 16:53:26,463: TOP-9 tasks by badness: 2023-08-10 16:53:26,463: Name PID badness 2023-08-10 16:53:26,463: --------------- ------- ------- 2023-08-10 16:53:26,463: mysqld 903309 849 2023-08-10 16:53:26,463: monit 7469 17 2023-08-10 16:53:26,463: vmagent 329736 5 2023-08-10 16:53:26,463: metricbeat 7217 4 2023-08-10 16:53:26,463: pmm-agent 329710 3 2023-08-10 16:53:26,463: ssm-agent-worke 3352 1 2023-08-10 16:53:26,463: wazuh-modulesd 51276 1 2023-08-10 16:53:26,463: mysqld_exporter 329735 1 2023-08-10 16:53:26,463: node_exporter 329737 1 2023-08-10 16:53:26,463: TOP printed in 1ms; process with highest badness: PID: 903309, name: mysqld, badness: 849 2023-08-10 16:53:26,463: Recheck memory levels... 2023-08-10 16:53:26,464: Memory status that requires corrective actions: MemAvailable [383 MiB, 5.0 %] <= soft_threshold_min_mem [384 MiB, 5.0 %] SwapFree [104 MiB, 5.1 %] <= soft_threshold_min_swap [205 MiB, 10.0 %] 2023-08-10 16:53:26,464: Victim status (found in 0ms): PID: 903309, name: mysqld, state: S (sleeping), EUID: 27, SID: 902753 (), lifetime: 4d 3h 32min 2s badness: 849, oom_score: 849, oom_score_adj: 0 Vm, MiB: Size: 10762, RSS: 6507 (Anon: 6507, File: 0, Shmem: 0), Swap: 1736 cgroup_v1: /system.slice/mysqld.service cgroup_v2: ancestry: PID 902781 (mysqld_safe) <= PID 1 (systemd) exe realpath: /usr/sbin/mysqld cwd realpath: /var/lib/mysql 2023-08-10 16:53:26,464: Memory info, MiB: 2023-08-10 16:53:26,464: total=7680, used=7029, free=118, available=383, shared=0, buffers=0, cache=534, 2023-08-10 16:53:26,464: swap_total=2048, swap_used=1944, swapfree=104 2023-08-10 16:53:26,464: Implementing a corrective action: Sending SIGTERM to the victim 2023-08-10 16:53:26,464: OK; total response time: 4ms 2023-08-10 16:53:26,464: What happened in the last 14d 10h 11min 19s: [ OK ] Sending SIGTERM to mysqld: 7 [ OK ] Sending SIGKILL to mysqld: 1 2023-08-10 16:53:26,495: The victim doesn't respond on corrective action in 0.031s_ 2023-08-10 16:53:26,495: Memory status after implementing a corrective action: MemAvailable: 382.8 MiB, SwapFree: 104.5 MiB 2023-08-10 16:53:26,595: <<=== FINISHING implement_corrective_action() ===<<

I compared and found that when post_kill_exe is executed normally, there will be this line in the log, but the log I provided does not have this line, so post_kill_exe is not executed max_soft_exit_time (value=10.0s) is exceeded for the victim, hard corrective action will be applied

hakavlad commented 1 year ago

does not have this line, because it is soft corrective action:

MemAvailable [383 MiB, 5.0 %] <= soft_threshold_min_mem [384 MiB, 5.0 %] Sending SIGTERM to the victim

hakavlad commented 1 year ago

The victim doesn't respond on corrective action in 0.031s

The process takes some time to complete. The hard threshold was not exceeded (in this log).

hakavlad commented 1 year ago

Could you please provide logs with HARD corrective action and post_kill_exe was not executed?

bearbear781 commented 1 year ago

Is there a trigger like post_kill_exe that can be used in soft corrective action?

hakavlad commented 1 year ago

trigger like post_kill_exe that can be used in soft corrective action

@SOFT_ACTION_RE_NAME  /// sh -c "kill -TERM $PID && your_command..."
bearbear781 commented 1 year ago

trigger like post_kill_exe that can be used in soft corrective action

@SOFT_ACTION_RE_NAME  /// sh -c "kill -TERM $PID && your_command..."

I will try for it, any problem will update again, thank you

bearbear781 commented 1 year ago

trigger like post_kill_exe that can be used in soft corrective action

@SOFT_ACTION_RE_NAME  /// sh -c "kill -TERM $PID && your_command..."

Can this be placed at the end of the config file? Or does it have to be in part 8?

hakavlad commented 1 year ago

Can be placed anywhere in the configuration file.