ovn-org / ovn-kubernetes

A robust Kubernetes networking platform
https://ovn-kubernetes.io/
Apache License 2.0
767 stars 333 forks source link

Bump OVN to ovn-24.03.2-19 to fix multicast bug #4457

Closed ricky-rav closed 3 days ago

ricky-rav commented 1 week ago

Bumps OVN to 24.03.2-19, which reverts multicast-related commits that introduced a regression. Extends the unit test to cover the scenario that was broken: add an additional receiver to the same node where the sender is. https://issues.redhat.com/browse/OCPBUGS-34778 https://issues.redhat.com/browse/FDP-656

coveralls commented 1 week ago

Coverage Status

coverage: 52.76% (+0.01%) from 52.749% when pulling 8594725b2c13bc9b9b646078e31accc38ff712f9 on ricky-rav:OCPBUGS-34778_upstream into 9f1f3f2866fc566ffbe582ae9adf77d60d838484 on ovn-org:master.

coveralls commented 1 week ago

Coverage Status

Changes unknown when pulling 9c4179a28325012603baf0c5ef21436f3d94f2de on ricky-rav:OCPBUGS-34778_upstream into on ovn-org:master.

tssurya commented 1 week ago

@qinqon oh both kv-migrations failed... looks like we need to look into this

tssurya commented 1 week ago

shard conformance: https://github.com/ovn-org/ovn-kubernetes/actions/runs/9597526159/job/26468141166?pr=4457 timed out :/

tssurya commented 1 week ago
2024-06-20T13:47:05.8175952Z Multicast should be able to send multicast UDP traffic between nodes
2024-06-20T13:47:05.8177373Z /home/runner/work/ovn-kubernetes/ovn-kubernetes/test/e2e/multicast.go:79
2024-06-20T13:47:05.8178804Z   STEP: Creating a kubernetes client @ 06/20/24 13:47:05.817
2024-06-20T13:47:05.8179917Z   Jun 20 13:47:05.817: INFO: >>> kubeConfig: /home/runner/ovn.conf
2024-06-20T13:47:05.8183639Z   STEP: Building a namespace api object, basename multicast @ 06/20/24 13:47:05.818
2024-06-20T13:47:05.8221785Z   Jun 20 13:47:05.821: INFO: Skipping waiting for service account
2024-06-20T13:47:05.8421587Z   STEP: creating a pod as a multicast source in node ovn-worker @ 06/20/24 13:47:05.841
2024-06-20T13:47:05.8487617Z   W0620 13:47:05.848044   74981 warnings.go:70] would violate PodSecurity "restricted:latest": allowPrivilegeEscalation != false (container "pod-client" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "pod-client" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "pod-client" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "pod-client" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
2024-06-20T13:47:07.8552911Z   STEP: creating first multicast listener pod in node ovn-worker2 @ 06/20/24 13:47:07.854
2024-06-20T13:47:07.8605049Z   W0620 13:47:07.859650   74981 warnings.go:70] would violate PodSecurity "restricted:latest": allowPrivilegeEscalation != false (container "pod-server1" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "pod-server1" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "pod-server1" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "pod-server1" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
2024-06-20T13:47:09.8687194Z   STEP: creating second multicast listener pod in node ovn-worker2 @ 06/20/24 13:47:09.868
2024-06-20T13:47:09.8735850Z   W0620 13:47:09.872877   74981 warnings.go:70] would violate PodSecurity "restricted:latest": allowPrivilegeEscalation != false (container "pod-server2" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "pod-server2" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "pod-server2" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "pod-server2" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
2024-06-20T13:47:11.8815733Z   STEP: creating first multicast listener pod in node ovn-worker @ 06/20/24 13:47:11.881
2024-06-20T13:47:11.8867196Z   W0620 13:47:11.886000   74981 warnings.go:70] would violate PodSecurity "restricted:latest": allowPrivilegeEscalation != false (container "pod-server3" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "pod-server3" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "pod-server3" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "pod-server3" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
2024-06-20T13:47:13.8938814Z   STEP: checking if pod server1 received multicast traffic @ 06/20/24 13:47:13.893
2024-06-20T13:47:13.9057278Z   STEP: checking if pod server2 does not received multicast traffic @ 06/20/24 13:47:13.905
2024-06-20T13:47:13.9089851Z   STEP: checking if pod server3 received multicast traffic @ 06/20/24 13:47:13.908
2024-06-20T13:47:13.9194687Z   STEP: Destroying namespace "multicast-8182" for this suite. @ 06/20/24 13:47:13.919
2024-06-20T13:47:13.9221258Z • [8.105 seconds]

test passes.

tssurya commented 1 week ago

external gateway lane: https://github.com/ovn-org/ovn-kubernetes/actions/runs/9597526159/job/26468146207?pr=4457 known flake: https://github.com/ovn-org/ovn-kubernetes/issues/4432

tssurya commented 1 week ago

Given its an OVN Bump and both the live migration jobs failed, I cannot merge with a red CI:

2024-06-20T13:19:20.8836054Z   Latency metrics for node ovn-worker3
2024-06-20T13:19:20.8837410Z   STEP: Destroying namespace "kv-live-migration-1853" for this suite. @ 06/20/24 13:19:20.883
2024-06-20T13:19:20.8884104Z • [FAILED] [214.253 seconds]
2024-06-20T13:19:20.8885911Z Kubevirt Virtual Machines with default pod network when live migration [It] with pre-copy succeeds, should keep connectivity
2024-06-20T13:19:20.8887975Z /home/runner/work/ovn-kubernetes/ovn-kubernetes/test/e2e/kubevirt.go:1093
2024-06-20T13:19:20.8888816Z 
2024-06-20T13:19:20.8889269Z   [FAILED] worker1: Expose tcpServer as a service
2024-06-20T13:19:20.8889948Z   Unexpected error:
2024-06-20T13:19:20.8890539Z       <*fmt.wrapError | 0xc000e8e160>: 
2024-06-20T13:19:20.8891567Z       failed DialTCP: dial tcp 172.18.0.2:32485: connect: connection refused
2024-06-20T13:19:20.8892420Z       {
2024-06-20T13:19:20.8893456Z           msg: "failed DialTCP: dial tcp 172.18.0.2:32485: connect: connection refused",
2024-06-20T13:19:20.8894626Z           err: <*net.OpError | 0xc000d9bf90>{
2024-06-20T13:19:20.8895290Z               Op: "dial",
2024-06-20T13:19:20.8896407Z               Net: "tcp",
2024-06-20T13:19:20.8897012Z               Source: nil,
2024-06-20T13:19:20.8897757Z               Addr: <*net.TCPAddr | 0xc001210000>{
2024-06-20T13:19:20.8898718Z                   IP: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 255, 255, 172, 18, 0, 2],
2024-06-20T13:19:20.8899401Z                   Port: 32485,
2024-06-20T13:19:20.8900010Z                   Zone: "",
2024-06-20T13:19:20.8900430Z               },
2024-06-20T13:19:20.8901008Z               Err: <*os.SyscallError | 0xc000e8e140>{
2024-06-20T13:19:20.8901670Z                   Syscall: "connect",
2024-06-20T13:19:20.8902173Z                   Err: <syscall.Errno>0x6f,
2024-06-20T13:19:20.8902787Z               },
2024-06-20T13:19:20.8903209Z           },
2024-06-20T13:19:20.8903610Z       }
2024-06-20T13:19:20.8904012Z   occurred
2024-06-20T13:19:20.8905263Z   In [It] at: /opt/hostedtoolcache/go/1.21.11/x64/src/runtime/asm_amd64.s:1650 @ 06/20/24 13:19:19.612
2024-06-20T13:19:20.8906000Z 

I see this it might not be related but can't risk a regression. Gut tells me to see at least 1 lane green; hence triggered a re-run of failed lanes

tssurya commented 1 week ago

live migration has failed again. We need some investigation on the CI failure @ricky-rav FYI before I can merge this

qinqon commented 1 week ago

@tssurya, few weeks ago we did test some ovn changes related to arp_proxy and they were working allright, maybe it is still problematic and we didn't test it well,

https://github.com/ovn-org/ovn/commit/cc4187b4b49e25bc60c94aff493ac22ffe0a418c

coveralls commented 4 days ago

Coverage Status

coverage: 52.734% (+0.03%) from 52.707% when pulling 935951b9fd5cd63b82998dc9a3ead5455d65eb3c on ricky-rav:OCPBUGS-34778_upstream into ebf2c6849cbd2f5164b9f062892a7b8483892ea4 on ovn-org:master.