Open trozet opened 8 months ago
one more here: https://github.com/ovn-org/ovn-kubernetes/actions/runs/7789676840/job/21242664047?pr=4100 :/
https://github.com/ovn-org/ovn-kubernetes/pull/4100
@maiqueb or @qinqon : can one of you please fix this?
one more here: https://github.com/ovn-org/ovn-kubernetes/actions/runs/7789676840/job/21242664047?pr=4100 :/
4100
@maiqueb or @qinqon : can one of you please fix this?
@tssurya we are preparing a fix for it, it will be ready to merge today.
one more: https://github.com/ovn-org/ovn-kubernetes/actions/runs/7813480320/job/21313563225?pr=4100#logs ack thanks @qinqon !!
@tssurya we are going to start small https://github.com/ovn-org/ovn-kubernetes/pull/4140
We have some other ideas but we will go step by step.
Thanks @qinqon
seen again here: https://github.com/ovn-org/ovn-kubernetes/actions/runs/7814875289/job/21318048354?pr=4061
Thanks @qinqon
seen again here: https://github.com/ovn-org/ovn-kubernetes/actions/runs/7814875289/job/21318048354?pr=4061
Adding error so I can track if it's always the same.
[FAILED] Timed out after 60.001s.
worker1: after live migration to node owning the subnet: Check tcp connection is not broken
Expected success, but got an error:
<*errors.errorString | 0xc000f52b70>:
http connection to virtual machine was broken
{
s: "http connection to virtual machine was broken",
}
In [It] at: /home/runner/work/ovn-kubernetes/ovn-kubernetes/test/e2e/kubevirt.go:330 @ 02/07/24 13:22:07.982
I am reproducing this error at my fork running 12 jobs in parallel and a more simple tcp server so is easier to debug https://github.com/qinqon/ovn-kubernetes/pull/10
https://github.com/qinqon/ovn-kubernetes/pull/10#issuecomment-1934147363
After around 25 runs show this failure with the stabilization PR https://github.com/ovn-org/ovn-kubernetes/actions/runs/7840556162/job/21395793644?pr=4145
2024-02-09T07:20:48.1135328Z Latency metrics for node ovn-worker3
2024-02-09T07:20:48.1136685Z [1mSTEP:[0m Destroying namespace "kv-live-migration-923" for this suite. [38;5;243m@ 02/09/24 07:20:48.113[0m
2024-02-09T07:20:48.1181879Z [38;5;9m• [FAILED] [419.375 seconds][0m
2024-02-09T07:20:48.1184612Z [0mKubevirt Virtual Machines [38;5;243mwhen live migration [38;5;9m[1m[It] with pre-copy succeeds, should keep connectivity[0m
2024-02-09T07:20:48.1187469Z [38;5;243m/home/runner/work/ovn-kubernetes/ovn-kubernetes/test/e2e/kubevirt.go:836[0m
2024-02-09T07:20:48.1188240Z
2024-02-09T07:20:48.1190075Z [38;5;9m[FAILED] worker1: after live migration for the second time to node not owning subnet: Check connectivity is restored after delete deny all network policy
2024-02-09T07:20:48.1191653Z Expected success, but got an error:
2024-02-09T07:20:48.1192369Z <*fmt.wrapError | 0xc000763d20>:
2024-02-09T07:20:48.1193746Z failed Write to server: write tcp 172.18.0.1:41446->172.18.0.3:31702: write: broken pipe
2024-02-09T07:20:48.1194723Z {
2024-02-09T07:20:48.1196036Z msg: "failed Write to server: write tcp 172.18.0.1:41446->172.18.0.3:31702: write: broken pipe",
2024-02-09T07:20:48.1197218Z err: <*net.OpError | 0xc00123f9f0>{
2024-02-09T07:20:48.1197908Z Op: "write",
2024-02-09T07:20:48.1198471Z Net: "tcp",
2024-02-09T07:20:48.1199748Z Source: <*net.TCPAddr | 0xc0006ed9b0>{IP: [172, 18, 0, 1], Port: 41446, Zone: ""},
2024-02-09T07:20:48.1201354Z Addr: <*net.TCPAddr | 0xc0006edaa0>{IP: [172, 18, 0, 3], Port: 31702, Zone: ""},
2024-02-09T07:20:48.1202601Z Err: <*os.SyscallError | 0xc000763d00>{
2024-02-09T07:20:48.1203335Z Syscall: "write",
2024-02-09T07:20:48.1203895Z Err: <syscall.Errno>0x20,
2024-02-09T07:20:48.1204307Z },
2024-02-09T07:20:48.1204642Z },
2024-02-09T07:20:48.1204937Z }[0m
2024-02-09T07:20:48.1206276Z [38;5;9mIn [1m[It][0m[38;5;9m at: [1m/home/runner/work/ovn-kubernetes/ovn-kubernetes/test/e2e/kubevirt.go:401[0m [38;5;243m@ 02/09/24 07:20:46.845[0m
I will try to reproduce this too with the testing PR.
I have disable interconnect and remove the policy part of the test to make it simpler also printing all the echos this is the result Server logs
Server is running on: :9900
2024/02/12 12:26:15 Handling connection 100.64.0.5:32870
2024/02/12 12:26:15 Handling connection [fd98::5]:58614
2024/02/12 12:28:33 failed copying data: readfrom tcp 10.244.1.8:9900->100.64.0.5:32870: splice: connection reset by peer
2024/02/12 12:28:33 Closing connection 100.64.0.5:32870
Client logs
STEP: worker1: after live migration for the second time to node not owning subnet: Check tcp connection is not broken @ 02/12/24 13:07:39.455
STEP: Writing 'Halo' @ 02/12/24 13:07:39.455
STEP: Reading 'Halo' @ 02/12/24 13:07:39.455
STEP: failed reading: read tcp 172.18.0.1:41882->172.18.0.4:30247: read: connection reset by peer @ 02/12/24 13:07:39.465
STEP: Writing 'Halo' @ 02/12/24 13:07:54.465
STEP: failed writing: write tcp 172.18.0.1:41882->172.18.0.4:30247: write: broken pipe @ 02/12/24 13:07:54.466
STEP: Writing 'Halo' @ 02/12/24 13:08:09.466
STEP: failed writing: write tcp 172.18.0.1:41882->172.18.0.4:30247: write: broken pipe @ 02/12/24 13:08:09.467
STEP: Writing 'Halo' @ 02/12/24 13:08:24.467
STEP: failed writing: write tcp 172.18.0.1:41882->172.18.0.4:30247: write: broken pipe @ 02/12/24 13:08:24.467
The test should not retry if first error is "connection reset by peer" since retrying will not fix anything.
I am going to attach a tcpdump to the client to see if we are receving at RST packet that triggers the "connection reset by peer" at first echo.
This should appear way less often after https://github.com/ovn-org/ovn-kubernetes/pull/4174
https://github.com/ovn-org/ovn-kubernetes/actions/runs/8478907825/job/23232584804?pr=4246
@trozet adding this link to https://github.com/ovn-org/ovn-kubernetes/issues/4237 since is the specific error related to nameserver
Also seen in https://github.com/ovn-org/ovn-kubernetes/pull/4468
Seeing failures with this test still:
https://github.com/ovn-org/ovn-kubernetes/actions/runs/6646915605/job/18062050736?pr=3979