Open liyi-ibm opened 5 years ago
[root@datanode11 /cgroup/cpu/hadoop-yarn]# cat cpu.cfs_period_us 100000 [root@datanode11 /cgroup/cpu/hadoop-yarn]# cat cpu.cfs_quota_us 17600000 [root@datanode11 /cgroup/cpu/hadoop-yarn]# Message from syslogd@datanode11 at Dec 11 03:10:35 ... kernel:Watchdog CPU:126 Hard LOCKUP
Message from syslogd@datanode11 at Dec 11 03:10:35 ... kernel:Watchdog CPU:126 became unstuck
Message from syslogd@datanode11 at Dec 11 03:10:49 ... kernel:Watchdog CPU:122 Hard LOCKUP
Message from syslogd@datanode11 at Dec 11 03:10:49 ... kernel:Watchdog CPU:122 became unstuck
The dmesg is memory related lockup. You need patched kernel. CPU related lockup will have unthrottle_cfs_rq
and other cfs
functions in backtrace. Memory related lockup will have alloc_pages
and mem_cgroup
in the backtrace. They are different scenarios. You certainly need to mitigate the memory cgroup lockup with the patched kernel before running heavy workload and then use my cpu.cfs_xx settings to avoid the cpu scheduler (cfs) related lockup.
On 4.14.49-3 kernel (without the memory control patch)