Open wangzhen127 opened 1 week ago
This looks like an infra issue. @BenTheElder Do you know who should we talk to?
CC @hakman
It's a problem with the jobs. SIG K8S infra does not create your test VMs. The test is attempting to SSH to a disposable test VM created by your job.
seems like the VM is not serving SSH or something similar
CC @DigitalVeer
If these are like node e2e tests, folks in SIG node might be familiar
SIG Testing strongly discourages ssh usage in cluster e2e tests, relying instead on hostexec pods when necessary, but for some node style testing that's not sufficient, and mostly folks in SIG Node work with this.
It's possible there is with an issue with the GCP projects rented by this test. It's unclear to me why the SSH connection is not working but I'll try to debug with @hakman.
This is an issue with cos-stable-117
. SSH works pretty well in all other tests (which are similar).
I tried to reproduce what happens with the ext4 test and found out that the command used in the test is:
echo "fake filesystem error from problem-maker" > /sys/fs/ext4/sda1/trigger_fs_error
Once this runs, the filesystem is mounted as read-only and SSH stops working with Connection reset by peer
:
[ 169.101160] EXT4-fs error (device sda1): trigger_test_error:127: comm bash: fake filesystem error from problem-maker
[ 169.108852] Aborting journal on device sda1-8.
[ 169.115130] EXT4-fs (sda1): Remounting filesystem read-only
There may be some recent changes that affect the behaviour of trigger_fs_error
.
https://lore.kernel.org/all/20241006033805.GB158527@mit.edu/t/#u
https://testgrid.k8s.io/presubmits-node-problem-detector#pull-npd-e2e-test starts to fail recently.
This is affecting several different PRs: https://github.com/kubernetes/node-problem-detector/pull/955, https://github.com/kubernetes/node-problem-detector/pull/961, https://github.com/kubernetes/node-problem-detector/pull/969.