[FAIL] Kubevirt Virtual Machines when live migrated [It] with pre-copy should keep connectivity

trozet commented 8 months ago

Seeing failures with this test still:

Kubevirt Virtual Machines when live migrated [It] with pre-copy should keep connectivity
/home/runner/work/ovn-kubernetes/ovn-kubernetes/test/e2e/kubevirt.go:656

  [FAILED] Oct 25 23:45:54.083: Timed out after 60.010s.
  worker1: after live migrate, migration #1: Check tcp connection is not broken
  Expected success, but got an error:
      <*errors.errorString | 0xc0003bb4a0>: 
      http connection to virtual machine was broken
      {
          s: "http connection to virtual machine was broken",
      }
  In [It] at: /home/runner/work/ovn-kubernetes/ovn-kubernetes/test/e2e/kubevirt.go:296 @ 10/25/23 23:45:54.084

  There were additional failures detected.  To view them in detail run ginkgo -vv

https://github.com/ovn-org/ovn-kubernetes/actions/runs/6646915605/job/18062050736?pr=3979

npinaeva commented 8 months ago

one more https://github.com/ovn-org/ovn-kubernetes/actions/runs/6734120950/job/18305537344?pr=3995

martinkennelly commented 6 months ago

Flake seen here: https://github.com/ovn-org/ovn-kubernetes/pull/4040 ( https://github.com/ovn-org/ovn-kubernetes/actions/runs/7120153599/job/19387670459?pr=4040 )

flavio-fernandes commented 6 months ago

Flake seen here: https://github.com/ovn-org/ovn-kubernetes/pull/4050 ( https://github.com/ovn-org/ovn-kubernetes/actions/runs/7199631620/job/19612611460?pr=4050 )

tssurya commented 4 months ago

one more here: https://github.com/ovn-org/ovn-kubernetes/actions/runs/7789676840/job/21242664047?pr=4100 :/

https://github.com/ovn-org/ovn-kubernetes/pull/4100

@maiqueb or @qinqon : can one of you please fix this?

qinqon commented 4 months ago

one more here: https://github.com/ovn-org/ovn-kubernetes/actions/runs/7789676840/job/21242664047?pr=4100 :/

4100

@maiqueb or @qinqon : can one of you please fix this?

@tssurya we are preparing a fix for it, it will be ready to merge today.

tssurya commented 4 months ago

one more: https://github.com/ovn-org/ovn-kubernetes/actions/runs/7813480320/job/21313563225?pr=4100#logs ack thanks @qinqon !!

qinqon commented 4 months ago

@tssurya we are going to start small https://github.com/ovn-org/ovn-kubernetes/pull/4140

We have some other ideas but we will go step by step.

martinkennelly commented 4 months ago

Thanks @qinqon

seen again here: https://github.com/ovn-org/ovn-kubernetes/actions/runs/7814875289/job/21318048354?pr=4061

qinqon commented 4 months ago

Thanks @qinqon

seen again here: https://github.com/ovn-org/ovn-kubernetes/actions/runs/7814875289/job/21318048354?pr=4061

Adding error so I can track if it's always the same.

 [FAILED] Timed out after 60.001s.
  worker1: after live migration to node owning the subnet: Check tcp connection is not broken
  Expected success, but got an error:
      <*errors.errorString | 0xc000f52b70>: 
      http connection to virtual machine was broken
      {
          s: "http connection to virtual machine was broken",
      }
  In [It] at: /home/runner/work/ovn-kubernetes/ovn-kubernetes/test/e2e/kubevirt.go:330 @ 02/07/24 13:22:07.982

qinqon commented 4 months ago

I am reproducing this error at my fork running 12 jobs in parallel and a more simple tcp server so is easier to debug https://github.com/qinqon/ovn-kubernetes/pull/10

https://github.com/qinqon/ovn-kubernetes/pull/10#issuecomment-1934147363

qinqon commented 4 months ago

After around 25 runs show this failure with the stabilization PR https://github.com/ovn-org/ovn-kubernetes/actions/runs/7840556162/job/21395793644?pr=4145

2024-02-09T07:20:48.1135328Z   Latency metrics for node ovn-worker3
2024-02-09T07:20:48.1136685Z   [1mSTEP:[0m Destroying namespace "kv-live-migration-923" for this suite. [38;5;243m@ 02/09/24 07:20:48.113[0m
2024-02-09T07:20:48.1181879Z [38;5;9m• [FAILED] [419.375 seconds][0m
2024-02-09T07:20:48.1184612Z [0mKubevirt Virtual Machines [38;5;243mwhen live migration [38;5;9m[1m[It] with pre-copy succeeds, should keep connectivity[0m
2024-02-09T07:20:48.1187469Z [38;5;243m/home/runner/work/ovn-kubernetes/ovn-kubernetes/test/e2e/kubevirt.go:836[0m
2024-02-09T07:20:48.1188240Z 
2024-02-09T07:20:48.1190075Z   [38;5;9m[FAILED] worker1: after live migration for the second time to node not owning subnet: Check connectivity is restored after delete deny all network policy
2024-02-09T07:20:48.1191653Z   Expected success, but got an error:
2024-02-09T07:20:48.1192369Z       <*fmt.wrapError | 0xc000763d20>: 
2024-02-09T07:20:48.1193746Z       failed Write to server: write tcp 172.18.0.1:41446->172.18.0.3:31702: write: broken pipe
2024-02-09T07:20:48.1194723Z       {
2024-02-09T07:20:48.1196036Z           msg: "failed Write to server: write tcp 172.18.0.1:41446->172.18.0.3:31702: write: broken pipe",
2024-02-09T07:20:48.1197218Z           err: <*net.OpError | 0xc00123f9f0>{
2024-02-09T07:20:48.1197908Z               Op: "write",
2024-02-09T07:20:48.1198471Z               Net: "tcp",
2024-02-09T07:20:48.1199748Z               Source: <*net.TCPAddr | 0xc0006ed9b0>{IP: [172, 18, 0, 1], Port: 41446, Zone: ""},
2024-02-09T07:20:48.1201354Z               Addr: <*net.TCPAddr | 0xc0006edaa0>{IP: [172, 18, 0, 3], Port: 31702, Zone: ""},
2024-02-09T07:20:48.1202601Z               Err: <*os.SyscallError | 0xc000763d00>{
2024-02-09T07:20:48.1203335Z                   Syscall: "write",
2024-02-09T07:20:48.1203895Z                   Err: <syscall.Errno>0x20,
2024-02-09T07:20:48.1204307Z               },
2024-02-09T07:20:48.1204642Z           },
2024-02-09T07:20:48.1204937Z       }[0m
2024-02-09T07:20:48.1206276Z   [38;5;9mIn [1m[It][0m[38;5;9m at: [1m/home/runner/work/ovn-kubernetes/ovn-kubernetes/test/e2e/kubevirt.go:401[0m [38;5;243m@ 02/09/24 07:20:46.845[0m

I will try to reproduce this too with the testing PR.

qinqon commented 4 months ago

I have disable interconnect and remove the policy part of the test to make it simpler also printing all the echos this is the result Server logs

Server is running on: :9900
  2024/02/12 12:26:15 Handling connection 100.64.0.5:32870
  2024/02/12 12:26:15 Handling connection [fd98::5]:58614
  2024/02/12 12:28:33 failed copying data: readfrom tcp 10.244.1.8:9900->100.64.0.5:32870: splice: connection reset by peer
  2024/02/12 12:28:33 Closing connection 100.64.0.5:32870

Client logs

STEP: worker1: after live migration for the second time to node not owning subnet: Check tcp connection is not broken @ 02/12/24 13:07:39.455
  STEP: Writing 'Halo' @ 02/12/24 13:07:39.455
  STEP: Reading 'Halo' @ 02/12/24 13:07:39.455
  STEP: failed reading: read tcp 172.18.0.1:41882->172.18.0.4:30247: read: connection reset by peer  @ 02/12/24 13:07:39.465
  STEP: Writing 'Halo' @ 02/12/24 13:07:54.465
  STEP: failed writing: write tcp 172.18.0.1:41882->172.18.0.4:30247: write: broken pipe  @ 02/12/24 13:07:54.466
  STEP: Writing 'Halo' @ 02/12/24 13:08:09.466
  STEP: failed writing: write tcp 172.18.0.1:41882->172.18.0.4:30247: write: broken pipe  @ 02/12/24 13:08:09.467
  STEP: Writing 'Halo' @ 02/12/24 13:08:24.467
  STEP: failed writing: write tcp 172.18.0.1:41882->172.18.0.4:30247: write: broken pipe  @ 02/12/24 13:08:24.467

The test should not retry if first error is "connection reset by peer" since retrying will not fix anything.

I am going to attach a tcpdump to the client to see if we are receving at RST packet that triggers the "connection reset by peer" at first echo.

qinqon commented 4 months ago

This should appear way less often after https://github.com/ovn-org/ovn-kubernetes/pull/4174

npinaeva commented 3 months ago

https://github.com/ovn-org/ovn-kubernetes/actions/runs/8449224256/job/23144057157?pr=4154

trozet commented 3 months ago

https://github.com/ovn-org/ovn-kubernetes/actions/runs/8478907825/job/23232584804?pr=4246

qinqon commented 3 months ago

https://github.com/ovn-org/ovn-kubernetes/actions/runs/8478907825/job/23232584804?pr=4246

@trozet adding this link to https://github.com/ovn-org/ovn-kubernetes/issues/4237 since is the specific error related to nameserver

martinkennelly commented 2 months ago

https://github.com/ovn-org/ovn-kubernetes/actions/runs/8648860156/job/23715049496?pr=4271

Seen in https://github.com/ovn-org/ovn-kubernetes/pull/4271

flavio-fernandes commented 5 days ago

Also seen in https://github.com/ovn-org/ovn-kubernetes/pull/4468

ovn-org / ovn-kubernetes

[FAIL] Kubevirt Virtual Machines when live migrated [It] with pre-copy should keep connectivity #3986

4100