kubernetes / node-problem-detector

This is a place for various problem detectors running on the Kubernetes nodes.
Apache License 2.0
2.97k stars 628 forks source link

pull-npd-e2e-test failing ssh handshake #970

Open wangzhen127 opened 1 week ago

wangzhen127 commented 1 week ago

https://testgrid.k8s.io/presubmits-node-problem-detector#pull-npd-e2e-test starts to fail recently.

[1] NPD should export Prometheus metrics. When OOM kills and docker hung happen 
[1]   NPD should update problem_counter and problem_gauge
[1]   /home/prow/go/src/k8s.io/node-problem-detector/test/e2e/metriconly/metrics_test.go:158
[2] error dialing prow@35.184.209.153:22: 'ssh: handshake failed: read tcp 10.32.2.7:54804->35.184.209.153:22: read: connection reset by peer', retrying
[2] error dialing prow@35.184.209.153:22: 'ssh: handshake failed: read tcp 10.32.2.7:52980->35.184.209.153:22: read: connection reset by peer', retrying
[2] error dialing prow@35.184.209.153:22: 'ssh: handshake failed: read tcp 10.32.2.7:53002->35.184.209.153:22: read: connection reset by peer', retrying
[2] error dialing prow@35.184.209.153:22: 'ssh: handshake failed: read tcp 10.32.2.7:44696->35.184.209.153:22: read: connection reset by peer', retrying
[2] Error storing debugging data to test artifacts: [Error running command: {prow 35.184.209.153 curl http://localhost:20257/metrics   0 error getting SSH client to prow@35.184.209.153:22: 'ssh: handshake failed: read tcp 10.32.2.7:52990->35.184.209.153:22: read: connection reset by peer'}
[2]  Error running command: {prow 35.184.209.153 sudo journalctl -u node-problem-detector.service   0 error getting SSH client to prow@35.184.209.153:22: 'ssh: handshake failed: read tcp 10.32.2.7:44688->35.184.209.153:22: read: connection reset by peer'}
[2]  Error running command: {prow 35.184.209.153 sudo journalctl -k   0 error getting SSH client to prow@35.184.209.153:22: 'ssh: handshake failed: read tcp 10.32.2.7:44708->35.184.209.153:22: read: connection reset by peer'}
[2] ]

This is affecting several different PRs: https://github.com/kubernetes/node-problem-detector/pull/955, https://github.com/kubernetes/node-problem-detector/pull/961, https://github.com/kubernetes/node-problem-detector/pull/969.

wangzhen127 commented 1 week ago

This looks like an infra issue. @BenTheElder Do you know who should we talk to?

CC @hakman

BenTheElder commented 1 week ago

It's a problem with the jobs. SIG K8S infra does not create your test VMs. The test is attempting to SSH to a disposable test VM created by your job.

seems like the VM is not serving SSH or something similar

wangzhen127 commented 1 week ago

CC @DigitalVeer

BenTheElder commented 1 week ago

If these are like node e2e tests, folks in SIG node might be familiar

SIG Testing strongly discourages ssh usage in cluster e2e tests, relying instead on hostexec pods when necessary, but for some node style testing that's not sufficient, and mostly folks in SIG Node work with this.

ameukam commented 1 week ago

It's possible there is with an issue with the GCP projects rented by this test. It's unclear to me why the SSH connection is not working but I'll try to debug with @hakman.

hakman commented 5 days ago

This is an issue with cos-stable-117. SSH works pretty well in all other tests (which are similar). I tried to reproduce what happens with the ext4 test and found out that the command used in the test is:

echo "fake filesystem error from problem-maker" > /sys/fs/ext4/sda1/trigger_fs_error

Once this runs, the filesystem is mounted as read-only and SSH stops working with Connection reset by peer:

[  169.101160] EXT4-fs error (device sda1): trigger_test_error:127: comm bash: fake filesystem error from problem-maker
[  169.108852] Aborting journal on device sda1-8.
[  169.115130] EXT4-fs (sda1): Remounting filesystem read-only

There may be some recent changes that affect the behaviour of trigger_fs_error. https://lore.kernel.org/all/20241006033805.GB158527@mit.edu/t/#u