kubernetes / node-problem-detector

This is a place for various problem detectors running on the Kubernetes nodes.
Apache License 2.0
2.85k stars 616 forks source link

after upgrading npd from v0.8.13 to v0.8.15, containers with unready status: [node-problem-detector] #865

Closed pacoxu closed 3 months ago

pacoxu commented 4 months ago

with 1.8.15

https://github.com/kubernetes/kubernetes/pull/123114#issuecomment-1963319709

F0216 07:10:52.272154       1 custom_plugin_monitor.go:77] Failed to validate custom plugin config {Plugin:custom PluginGlobalConfig:{InvokeIntervalString:0xc0004f9fc0 TimeoutString:0xc0004f9fd0 InvokeInterval:5m0s Timeout:1m0s MaxOutputLength:0xc00056bb90 Concurrency:0xc00056bba0 EnableMessageChangeBasedConditionUpdate:0x2d0a80e SkipInitialStatus:0x2d0a80f} Source:kernel-monitor DefaultConditions:[{Type:FrequentUnregisterNetDevice Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoFrequentUnregisterNetDevice Message:node is functioning properly}] Rules:[0xc0004ded90] EnableMetricsReporting:0xc00056bba8}: rule path "/home/kubernetes/bin/log-counter" does not exist. Rule: &{Type:permanent Condition:FrequentUnregisterNetDevice Reason:UnregisterNetDevice Path:/home/kubernetes/bin/log-counter Args:[--journald-source=kernel --log-path=/var/log/journal --lookback=20m --count=3 --pattern=unregister_netdevice: waiting for \w+ to become free. Usage count = \d+] TimeoutString:0xc0004f9ff0 Timeout:1m0s}
I0216 07:10:53.766460       1 log_monitor.go:78] Finish parsing log monitor config file /config/kernel-monitor.json: {WatcherConfig:{Plugin:kmsg PluginConfig:map[] LogPath:/dev/kmsg Lookback:5m Delay:} BufferSize:10 Source:kernel-monitor DefaultConditions:[{Type:KernelDeadlock Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:KernelHasNoDeadlock Message:kernel has no deadlock} {Type:ReadonlyFilesystem Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:FilesystemIsNotReadOnly Message:Filesystem is not read-only}] Rules:[{Type:temporary Condition: Reason:OOMKilling Pattern:Killed process \d+ (.+) total-vm:\d+kB, anon-rss:\d+kB, file-rss:\d+kB.*} {Type:temporary Condition: Reason:TaskHung Pattern:task [\S ]+:\w+ blocked for more than \w+ seconds\.} {Type:temporary Condition: Reason:UnregisterNetDevice Pattern:unregister_netdevice: waiting for \w+ to become free. Usage count = \d+} {Type:temporary Condition: Reason:KernelOops Pattern:BUG: unable to handle kernel NULL pointer dereference at .*} {Type:temporary Condition: Reason:KernelOops Pattern:divide error: 0000 \[#\d+\] SMP} {Type:temporary Condition: Reason:Ext4Error Pattern:EXT4-fs error .*} {Type:temporary Condition: Reason:Ext4Warning Pattern:EXT4-fs warning .*} {Type:temporary Condition: Reason:IOError Pattern:Buffer I/O error .*} {Type:temporary Condition: Reason:MemoryReadError Pattern:CE memory read error .*} {Type:permanent Condition:KernelDeadlock Reason:DockerHung Pattern:task docker:\w+ blocked for more than \w+ seconds\.} {Type:permanent Condition:ReadonlyFilesystem Reason:FilesystemIsReadOnly Pattern:Remounting filesystem read-only}] EnableMetricsReporting:0xc0006e858e}
I0216 07:10:53.766678       1 log_watchers.go:40] Use log watcher of plugin "kmsg"
I0216 07:10:53.767239       1 log_monitor.go:78] Finish parsing log monitor config file /config/systemd-monitor.json: {WatcherConfig:{Plugin:journald PluginConfig:map[source:systemd] LogPath:/var/log/journal Lookback:5m Delay:} BufferSize:10 Source:systemd-monitor DefaultConditions:[] Rules:[{Type:temporary Condition: Reason:KubeletStart Pattern:Started Kubernetes kubelet.} {Type:temporary Condition: Reason:DockerStart Pattern:Starting Docker Application Container Engine...} {Type:temporary Condition: Reason:ContainerdStart Pattern:Starting containerd container runtime...}] EnableMetricsReporting:0xc0006e8cea}
I0216 07:10:53.767297       1 log_watchers.go:40] Use log watcher of plugin "journald"
F0216 07:10:53.770149       1 custom_plugin_monitor.go:77] Failed to validate custom plugin config {Plugin:custom PluginGlobalConfig:{InvokeIntervalString:0xc00035c8f0 TimeoutString:0xc00035c900 InvokeInterval:5m0s Timeout:1m0s MaxOutputLength:0xc00013af60 Concurrency:0xc00013af70 EnableMessageChangeBasedConditionUpdate:0x2d0a80e SkipInitialStatus:0x2d0a80f} Source:kernel-monitor DefaultConditions:[{Type:FrequentUnregisterNetDevice Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoFrequentUnregisterNetDevice Message:node is functioning properly}] Rules:[0xc00039c000] EnableMetricsReporting:0xc00013af78}: rule path "/home/kubernetes/bin/log-counter" does not exist. Rule: &{Type:permanent Condition:FrequentUnregisterNetDevice Reason:UnregisterNetDevice Path:/home/kubernetes/bin/log-counter Args:[--journald-source=kernel --log-path=/var/log/journal --lookback=20m --count=3 --pattern=unregister_netdevice: waiting for \w+ to become free. Usage count = \d+] TimeoutString:0xc00035c920 Timeout:1m0s}
F0216 07:11:07.771971       1 custom_plugin_monitor.go:77] Failed to validate custom plugin config {Plugin:custom PluginGlobalConfig:{InvokeIntervalString:0xc0006ec490 TimeoutString:0xc0006ec4a0 InvokeInterval:5m0s Timeout:1m0s MaxOutputLength:0xc000146cd0 Concurrency:0xc000146ce0 EnableMessageChangeBasedConditionUpdate:0x2d0a80e SkipInitialStatus:0x2d0a80f} Source:kernel-monitor DefaultConditions:[{Type:FrequentUnregisterNetDevice Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoFrequentUnregisterNetDevice Message:node is functioning properly}] Rules:[0xc000418cb0] EnableMetricsReporting:0xc000146ce8}: rule path "/home/kubernetes/bin/log-counter" does not exist. Rule: &{Type:permanent Condition:FrequentUnregisterNetDevice Reason:UnregisterNetDevice Path:/home/kubernetes/bin/log-counter Args:[--journald-source=kernel --log-path=/var/log/journal --lookback=20m --count=3 --pattern=unregister_netdevice: waiting for \w+ to become free. Usage count = \d+] TimeoutString:0xc0006ec4c0 Timeout:1m0s}
F0216 07:11:35.668334       1 custom_plugin_monitor.go:77] Failed to validate custom plugin config {Plugin:custom PluginGlobalConfig:{InvokeIntervalString:0xc0006b2c70 TimeoutString:0xc0006b2c80 InvokeInterval:5m0s Timeout:1m0s MaxOutputLength:0xc0006cc9d0 Concurrency:0xc0006cc9e0 EnableMessageChangeBasedConditionUpdate:0x2d0a80e SkipInitialStatus:0x2d0a80f} Source:kernel-monitor DefaultConditions:[{Type:FrequentUnregisterNetDevice Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoFrequentUnregisterNetDevice Message:node is functioning properly}] Rules:[0xc000418d20] EnableMetricsReporting:0xc0006cc9e8}: rule path "/home/kubernetes/bin/log-counter" does not exist. Rule: &{Type:permanent Condition:FrequentUnregisterNetDevice Reason:UnregisterNetDevice Path:/home/kubernetes/bin/log-counter Args:[--journald-source=kernel --log-path=/var/log/journal --lookback=20m --count=3 --pattern=unregister_netdevice: waiting for \w+ to become free. Usage count = \d+] TimeoutString:0xc0006b2ca0 Timeout:1m0s}
F0216 07:12:26.664969       1 custom_plugin_monitor.go:77] Failed to validate custom plugin config {Plugin:custom PluginGlobalConfig:{InvokeIntervalString:0xc0004ae2a0 TimeoutString:0xc0004ae2b0 InvokeInterval:5m0s Timeout:1m0s MaxOutputLength:0xc000015f50 Concurrency:0xc000015f60 EnableMessageChangeBasedConditionUpdate:0x2d0a80e SkipInitialStatus:0x2d0a80f} Source:kernel-monitor DefaultConditions:[{Type:FrequentUnregisterNetDevice Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoFrequentUnregisterNetDevice Message:node is functioning properly}] Rules:[0xc0000cc5b0] EnableMetricsReporting:0xc000015f68}: rule path "/home/kubernetes/bin/log-counter" does not exist. Rule: &{Type:permanent Condition:FrequentUnregisterNetDevice Reason:UnregisterNetDevice Path:/home/kubernetes/bin/log-counter Args:[--journald-source=kernel --log-path=/var/log/journal --lookback=20m --count=3 --pattern=unregister_netdevice: waiting for \w+ to become free. Usage count = \d+] TimeoutString:0xc0004ae2d0 Timeout:1m0s}
F0216 07:13:57.671249       1 custom_plugin_monitor.go:77] Failed to validate custom plugin config {Plugin:custom PluginGlobalConfig:{InvokeIntervalString:0xc000660240 TimeoutString:0xc000660250 InvokeInterval:5m0s Timeout:1m0s MaxOutputLength:0xc000146510 Concurrency:0xc000146520 EnableMessageChangeBasedConditionUpdate:0x2d0a80e SkipInitialStatus:0x2d0a80f} Source:kernel-monitor DefaultConditions:[{Type:FrequentUnregisterNetDevice Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoFrequentUnregisterNetDevice Message:node is functioning properly}] Rules:[0xc000418000] EnableMetricsReporting:0xc000146528}: rule path "/home/kubernetes/bin/log-counter" does not exist. Rule: &{Type:permanent Condition:FrequentUnregisterNetDevice Reason:UnregisterNetDevice Path:/home/kubernetes/bin/log-counter Args:[--journald-source=kernel --log-path=/var/log/journal --lookback=20m --count=3 --pattern=unregister_netdevice: waiting for \w+ to become free. Usage count = \d+] TimeoutString:0xc000660270 Timeout:1m0s}
F0216 07:16:42.672808       1 custom_plugin_monitor.go:77] Failed to validate custom plugin config {Plugin:custom PluginGlobalConfig:{InvokeIntervalString:0xc000572d60 TimeoutString:0xc000572d70 InvokeInterval:5m0s Timeout:1m0s MaxOutputLength:0xc00051dc20 Concurrency:0xc00051dc30 EnableMessageChangeBasedConditionUpdate:0x2d0a80e SkipInitialStatus:0x2d0a80f} Source:kernel-monitor DefaultConditions:[{Type:FrequentUnregisterNetDevice Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoFrequentUnregisterNetDevice Message:node is functioning properly}] Rules:[0xc0001bd810] EnableMetricsReporting:0xc00051dc38}: rule path "/home/kubernetes/bin/log-counter" does not exist. Rule: &{Type:permanent Condition:FrequentUnregisterNetDevice Reason:UnregisterNetDevice Path:/home/kubernetes/bin/log-counter Args:[--journald-source=kernel --log-path=/var/log/journal --lookback=20m --count=3 --pattern=unregister_netdevice: waiting for \w+ to become free. Usage count = \d+] TimeoutString:0xc000572d90 Timeout:1m0s}
F0216 07:21:50.960997       1 custom_plugin_monitor.go:77] Failed to validate custom plugin config {Plugin:custom PluginGlobalConfig:{InvokeIntervalString:0xc0001179f0 TimeoutString:0xc000117a00 InvokeInterval:5m0s Timeout:1m0s MaxOutputLength:0xc000533c70 Concurrency:0xc000533c80 EnableMessageChangeBasedConditionUpdate:0x2d0a80e SkipInitialStatus:0x2d0a80f} Source:kernel-monitor DefaultConditions:[{Type:FrequentUnregisterNetDevice Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoFrequentUnregisterNetDevice Message:node is functioning properly}] Rules:[0xc00050b650] EnableMetricsReporting:0xc000533c88}: rule path "/home/kubernetes/bin/log-counter" does not exist. Rule: &{Type:permanent Condition:FrequentUnregisterNetDevice Reason:UnregisterNetDevice Path:/home/kubernetes/bin/log-counter Args:[--journald-source=kernel --log-path=/var/log/journal --lookback=20m --count=3 --pattern=unregister_netdevice: waiting for \w+ to become free. Usage count = \d+] TimeoutString:0xc000117a20 Timeout:1m0s}
pacoxu commented 4 months ago
Failed to validate custom plugin config 

{
  Plugin:custom 
  PluginGlobalConfig:{
    InvokeIntervalString:0xc0004711c0 
    TimeoutString:0xc0004711d0 
    InvokeInterval:5m0s Timeout:1m0s MaxOutputLength:0xc00056f4c0 Concurrency:0xc00056f4d0 EnableMessageChangeBasedConditionUpdate:0x2d0a80e SkipInitialStatus:0x2d0a80f
  } 
  Source:kernel-monitor 
  DefaultConditions:[
    {
      Type:FrequentUnregisterNetDevice Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoFrequentUnregisterNetDevice Message:node is functioning properly
    }
  ] 
  Rules:[0xc0005ece00] 
  EnableMetricsReporting:0xc00056f4d8
}:

rule path "/home/kubernetes/bin/log-counter" does not exist. 
Rule: &{
  Type:permanent 
  Condition:FrequentUnregisterNetDevice 
  Reason:UnregisterNetDevice 
  Path:/home/kubernetes/bin/log-counter 
  Args:[
    --journald-source=kernel 
    --log-path=/var/log/journal 
    --lookback=20m
    --count=3 
    --pattern=
    unregister_netdevice: waiting for \w+ to become free. Usage count = \d+
    ]
  TimeoutString:0xc0004711f0 Timeout:1m0s}

https://github.com/kubernetes/node-problem-detector/blob/d1166d3495cb5bf8cc340dc7ee6a3aff3f1452c1/config/kernel-monitor-counter.json#L23

pacoxu commented 4 months ago

1.8.15 lost the bin log-counter after https://github.com/kubernetes/node-problem-detector/pull/801 @hakman @vteratipally

➜  ~ docker run -it --rm --entrypoint=ls registry.k8s.io/node-problem-detector/node-problem-detector:v0.8.15 /home/kubernetes/bin/
health-checker
➜  ~ docker run -it --rm --entrypoint=ls registry.k8s.io/node-problem-detector/node-problem-detector:v0.8.13 /home/kubernetes/bin/
health-checker  log-counter
pacoxu commented 4 months ago

Local run shows that the log-counter is not built due to no journald

WARNING: No output specified with docker-container driver. Build result will only remain in the build cache. To push result image into registry use --push or to load image into docker use --load
CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build \
        -o bin/node-problem-detector \
        -ldflags '-X k8s.io/node-problem-detector/pkg/version.version=v0.8.15-20-gd1166d34' \
        -tags "" \
        ./cmd/nodeproblemdetector
echo "Warning: log-counter requires journald, skipping."
Warning: log-counter requires journald, skipping.
hakman commented 4 months ago

My guess is that it happens because of the way CloudBuild runs: https://github.com/kubernetes/node-problem-detector/blob/d1166d3495cb5bf8cc340dc7ee6a3aff3f1452c1/cloudbuild.yaml#L21

pacoxu commented 4 months ago

This seems to be intended for https://github.com/kubernetes/test-infra/issues/23202#issuecomment-1060219883?

pacoxu commented 4 months ago

/cc @SergeyKanzhelev

@hakman do you have some proposals to fix this?

hakman commented 4 months ago

@pacoxu @SergeyKanzhelev Let's give https://github.com/kubernetes/node-problem-detector/pull/867 a try.

hakman commented 4 months ago

@pacoxu could you give gcr.io/k8s-staging-npd/node-problem-detector:master a try? If all ok, we could do a release.

wangzhen127 commented 3 months ago

It looks like the issue was resolved in https://github.com/kubernetes/kubernetes/pull/123114.

/close

k8s-ci-robot commented 3 months ago

@wangzhen127: Closing this issue.

In response to [this](https://github.com/kubernetes/node-problem-detector/issues/865#issuecomment-2040265806): >It looks like the issue was resolved in https://github.com/kubernetes/kubernetes/pull/123114. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.