kubernetes-sigs / cloud-provider-azure

Cloud provider for Azure
https://cloud-provider-azure.sigs.k8s.io/
Apache License 2.0
257 stars 270 forks source link

[k/k][e2e] Services should be able to up and down services #6293

Open lzhecheng opened 1 month ago

lzhecheng commented 1 month ago

Which jobs are failing:

cloud-provider-azure-conformance-multiple-zones-vmss-capz

Which test(s) are failing:

Services should be able to up and down services

Since when has it been failing:

2024.5.11

Testgrid link:

https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/cloud-provider-azure-conformance-multiple-zones-vmss-capz/1793356241532096512

Reason for failure:

Anything else we need to know:

This test fails at a simple step: A hostnetwork=true Pod wgets a service with a command like this: wget -q -O - -T 1 http://<svc-ip>:<port> 2>&1

wget a service with endpoints on a node in another zone, wget timeout is not long enough. 1s fails but 2s works.

I did a few experiments. Target Timeout Work?
Svc 1s N
Svc 2s Y
Endpoint 1s Y

Somehow, since 5.11, the test starts failing without obvious changes anywhere.

It may be related to Azure (multiple zone), or CNI I guess.

cc @feiskyer

lzhecheng commented 1 month ago

Here are what I found and it seems to be related to calico:

Traffic is from a Node and to a service whose endpoint is on another node (different zone).

I captured packets on dst Node with different calico versions.

Calico 3.28.0:

^C06:32:43.407457 IP zhecheng-522-1-mp-0000000.internal.cloudapp.net.58595 > zhecheng-522-1-mp-0000001.internal.cloudapp.net.4789: VXLAN, flags [I] (0x08), vni 4096
IP 192.168.64.192.11362 > 192.168.198.91.9376: Flags [S], seq 3224522817, win 64240, options [mss 1460,sackOK,TS val 789304834 ecr 0,nop,wscale 7], length 0
06:32:44.417142 IP zhecheng-522-1-mp-0000000.internal.cloudapp.net.55170 > zhecheng-522-1-mp-0000001.internal.cloudapp.net.4789: VXLAN, flags [I] (0x08), vni 4096
IP 192.168.64.192.11362 > 192.168.198.91.9376: Flags [S], seq 3224522817, win 64240, options [mss 1460,sackOK,TS val 789305844 ecr 0,nop,wscale 7], length 0
06:32:44.417334 IP zhecheng-522-1-mp-0000001.internal.cloudapp.net.54585 > zhecheng-522-1-mp-0000000.internal.cloudapp.net.4789: VXLAN, flags [I] (0x08), vni 4096
IP 192.168.198.91.9376 > 192.168.64.192.11362: Flags [S.], seq 2917715798, ack 3224522818, win 64900, options [mss 1310,sackOK,TS val 2643650167 ecr 789305844,nop,wscale 7], length 0

The first packet arrived dst Node at 06:32:43.407457 but there was no response. After 1 second, for the second packet arrived dst Node at 06:32:44.417142, there was a response. Somehow, the first packet is dropped. There's no such issue with calico 3.27.3

Calico 3.27.3

^C03:04:31.057520 IP zhecheng-528-mp-0000000.internal.cloudapp.net.34057 > zhecheng-528-mp-0000001.internal.cloudapp.net.4789: VXLAN, flags [I] (0x08), vni 4096
IP zhecheng-528-mp-0000000.49981 > 192.168.121.208.9376: Flags [S], seq 1708607805, win 64240, options [mss 1460,sackOK,TS val 752390286 ecr 0,nop,wscale 7], length 0
03:04:31.059985 IP zhecheng-528-mp-0000001.internal.cloudapp.net.60462 > zhecheng-528-mp-0000000.internal.cloudapp.net.4789: VXLAN, flags [I] (0x08), vni 4096
IP 192.168.121.208.9376 > zhecheng-528-mp-0000000.49981: Flags [S.], seq 3450770987, ack 1708607806, win 64900, options [mss 1310,sackOK,TS val 446376357 ecr 752390286,nop,wscale 7], length 0