Open lzhecheng opened 1 month ago
Here are what I found and it seems to be related to calico:
Traffic is from a Node and to a service whose endpoint is on another node (different zone).
I captured packets on dst Node with different calico versions.
Calico 3.28.0:
^C06:32:43.407457 IP zhecheng-522-1-mp-0000000.internal.cloudapp.net.58595 > zhecheng-522-1-mp-0000001.internal.cloudapp.net.4789: VXLAN, flags [I] (0x08), vni 4096
IP 192.168.64.192.11362 > 192.168.198.91.9376: Flags [S], seq 3224522817, win 64240, options [mss 1460,sackOK,TS val 789304834 ecr 0,nop,wscale 7], length 0
06:32:44.417142 IP zhecheng-522-1-mp-0000000.internal.cloudapp.net.55170 > zhecheng-522-1-mp-0000001.internal.cloudapp.net.4789: VXLAN, flags [I] (0x08), vni 4096
IP 192.168.64.192.11362 > 192.168.198.91.9376: Flags [S], seq 3224522817, win 64240, options [mss 1460,sackOK,TS val 789305844 ecr 0,nop,wscale 7], length 0
06:32:44.417334 IP zhecheng-522-1-mp-0000001.internal.cloudapp.net.54585 > zhecheng-522-1-mp-0000000.internal.cloudapp.net.4789: VXLAN, flags [I] (0x08), vni 4096
IP 192.168.198.91.9376 > 192.168.64.192.11362: Flags [S.], seq 2917715798, ack 3224522818, win 64900, options [mss 1310,sackOK,TS val 2643650167 ecr 789305844,nop,wscale 7], length 0
The first packet arrived dst Node at 06:32:43.407457
but there was no response. After 1 second, for the second packet arrived dst Node at 06:32:44.417142
, there was a response.
Somehow, the first packet is dropped. There's no such issue with calico 3.27.3
Calico 3.27.3
^C03:04:31.057520 IP zhecheng-528-mp-0000000.internal.cloudapp.net.34057 > zhecheng-528-mp-0000001.internal.cloudapp.net.4789: VXLAN, flags [I] (0x08), vni 4096
IP zhecheng-528-mp-0000000.49981 > 192.168.121.208.9376: Flags [S], seq 1708607805, win 64240, options [mss 1460,sackOK,TS val 752390286 ecr 0,nop,wscale 7], length 0
03:04:31.059985 IP zhecheng-528-mp-0000001.internal.cloudapp.net.60462 > zhecheng-528-mp-0000000.internal.cloudapp.net.4789: VXLAN, flags [I] (0x08), vni 4096
IP 192.168.121.208.9376 > zhecheng-528-mp-0000000.49981: Flags [S.], seq 3450770987, ack 1708607806, win 64900, options [mss 1310,sackOK,TS val 446376357 ecr 752390286,nop,wscale 7], length 0
Which jobs are failing:
cloud-provider-azure-conformance-multiple-zones-vmss-capz
Which test(s) are failing:
Services should be able to up and down services
Since when has it been failing:
2024.5.11
Testgrid link:
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/cloud-provider-azure-conformance-multiple-zones-vmss-capz/1793356241532096512
Reason for failure:
Anything else we need to know:
This test fails at a simple step: A hostnetwork=true Pod wgets a service with a command like this:
wget -q -O - -T 1 http://<svc-ip>:<port> 2>&1
wget a service with endpoints on a node in another zone, wget timeout is not long enough. 1s fails but 2s works.
Somehow, since 5.11, the test starts failing without obvious changes anywhere.
It may be related to Azure (multiple zone), or CNI I guess.
cc @feiskyer