ovn-kubernetes / ovn-kubernetes

A robust Kubernetes networking platform
https://ovn-kubernetes.io/
Apache License 2.0
844 stars 352 forks source link

LGW mode: Should ensure load balancer service works with 0 node ports when ETP=local #3490

Closed tssurya closed 1 year ago

tssurya commented 1 year ago
2023-03-15T20:07:19.5985431Z Mar 15 20:07:19.598: INFO: Waiting for amount of service:lbservice-test endpoints to be 4
2023-03-15T20:07:20.5985550Z Mar 15 20:07:20.598: INFO: Waiting for amount of service:lbservice-test endpoints to be 4
2023-03-15T20:07:21.5984753Z Mar 15 20:07:21.598: INFO: Waiting for amount of service:lbservice-test endpoints to be 4
2023-03-15T20:07:22.5994268Z Mar 15 20:07:22.598: INFO: Waiting for amount of service:lbservice-test endpoints to be 4
2023-03-15T20:07:27.8457471Z STEP: by sending a TCP packet to service lbservice-test with type=LoadBalancer in namespace default from backend pod lb-backend-pod
2023-03-15T20:07:28.3666996Z STEP: patching service lbservice-test to allocateLoadBalancerNodePorts=false and externalTrafficPolicy=local
2023-03-15T20:07:28.3753587Z Mar 15 20:07:28.374: INFO: Running '/usr/local/bin/kubectl --server=https://127.0.0.1:40685 --kubeconfig=/home/runner/ovn.conf --namespace=default get svc lbservice-test -o=jsonpath='{.spec.allocateLoadBalancerNodePorts}''
2023-03-15T20:07:28.5519822Z Mar 15 20:07:28.551: INFO: stderr: ""
2023-03-15T20:07:28.5521023Z Mar 15 20:07:28.551: INFO: stdout: "'false'"
2023-03-15T20:07:28.6231913Z Mar 15 20:07:28.622: INFO: Running '/usr/local/bin/kubectl --server=https://127.0.0.1:40685 --kubeconfig=/home/runner/ovn.conf --namespace=default get svc lbservice-test -o=jsonpath='{.spec.externalTrafficPolicy}''
2023-03-15T20:07:28.7691438Z Mar 15 20:07:28.768: INFO: stderr: ""
2023-03-15T20:07:28.7692670Z Mar 15 20:07:28.768: INFO: stdout: "'Local'"
2023-03-15T20:07:33.9597919Z STEP: by sending a TCP packet to service lbservice-test with type=LoadBalancer in namespace default from backend pod lb-backend-pod
2023-03-15T20:07:34.7236627Z STEP: Scale down endpoints of service: lbservice-test to ensure iptable rules are also getting recreated correctly
2023-03-15T20:07:34.7238638Z Mar 15 20:07:34.723: INFO: Running '/usr/local/bin/kubectl --server=https://127.0.0.1:40685 --kubeconfig=/home/runner/ovn.conf --namespace=default scale deployment lb-backend-pod --replicas=3'
2023-03-15T20:07:34.8249074Z Mar 15 20:07:34.823: INFO: stderr: ""
2023-03-15T20:07:34.8267554Z Mar 15 20:07:34.823: INFO: stdout: "deployment.apps/lb-backend-pod scaled\n"
2023-03-15T20:08:20.0748199Z Mar 15 20:08:20.073: FAIL: Couldn't fetch the correct number of iptable rules, err: timed out waiting for the condition
2023-03-15T20:08:20.0748668Z Unexpected error:
2023-03-15T20:08:20.0749034Z     <*errors.errorString | 0xc000204260>: {
2023-03-15T20:08:20.0749498Z         s: "timed out waiting for the condition",
2023-03-15T20:08:20.0750096Z     }
2023-03-15T20:08:20.0750403Z     timed out waiting for the condition
2023-03-15T20:08:20.0750675Z occurred
2023-03-15T20:08:20.0750816Z 
2023-03-15T20:08:20.0750926Z Full Stack Trace
2023-03-15T20:08:20.0751368Z github.com/ovn-org/ovn-kubernetes/test/e2e.glob..func26.4()
2023-03-15T20:08:20.0752011Z    /home/runner/work/ovn-kubernetes/ovn-kubernetes/test/e2e/service.go:831 +0xb68
2023-03-15T20:08:20.0752484Z github.com/onsi/ginkgo/internal/leafnodes.(*runner).runSync(0x300000002?)
2023-03-15T20:08:20.0753021Z    /home/runner/go/pkg/mod/github.com/onsi/ginkgo@v1.16.5/internal/leafnodes/runner.go:113 +0xb1
2023-03-15T20:08:20.0753604Z github.com/onsi/ginkgo/internal/leafnodes.(*runner).run(0xc000d495a0?)
2023-03-15T20:08:20.0754132Z    /home/runner/go/pkg/mod/github.com/onsi/ginkgo@v1.16.5/internal/leafnodes/runner.go:64 +0x125
2023-03-15T20:08:20.0754605Z github.com/onsi/ginkgo/internal/leafnodes.(*ItNode).Run(0x300000002?)
2023-03-15T20:08:20.0755122Z    /home/runner/go/pkg/mod/github.com/onsi/ginkgo@v1.16.5/internal/leafnodes/it_node.go:26 +0x7b
2023-03-15T20:08:20.0755634Z github.com/onsi/ginkgo/internal/spec.(*Spec).runSample(0xc00081aff0, 0xc000d49968?, {0x1ecf300, 0xc000106900})
2023-03-15T20:08:20.0756175Z    /home/runner/go/pkg/mod/github.com/onsi/ginkgo@v1.16.5/internal/spec/spec.go:215 +0x28a
2023-03-15T20:08:20.0756777Z github.com/onsi/ginkgo/internal/spec.(*Spec).Run(0xc00081aff0, {0x1ecf300, 0xc000106900})
2023-03-15T20:08:20.0757285Z    /home/runner/go/pkg/mod/github.com/onsi/ginkgo@v1.16.5/internal/spec/spec.go:138 +0xe7
2023-03-15T20:08:20.0757784Z github.com/onsi/ginkgo/internal/specrunner.(*SpecRunner).runSpec(0xc000712160, 0xc00081aff0)
2023-03-15T20:08:20.0758347Z    /home/runner/go/pkg/mod/github.com/onsi/ginkgo@v1.16.5/internal/specrunner/spec_runner.go:200 +0xe8
2023-03-15T20:08:20.0758868Z github.com/onsi/ginkgo/internal/specrunner.(*SpecRunner).runSpecs(0xc000712160)
2023-03-15T20:08:20.0759406Z    /home/runner/go/pkg/mod/github.com/onsi/ginkgo@v1.16.5/internal/specrunner/spec_runner.go:170 +0x1a5
2023-03-15T20:08:20.0759896Z github.com/onsi/ginkgo/internal/specrunner.(*SpecRunner).Run(0xc000712160)
2023-03-15T20:08:20.0760432Z    /home/runner/go/pkg/mod/github.com/onsi/ginkgo@v1.16.5/internal/specrunner/spec_runner.go:66 +0xc5
2023-03-15T20:08:20.0761132Z github.com/onsi/ginkgo/internal/suite.(*Suite).Run(0xc00018ea10, {0x7f3c34042098, 0xc0003df1e0}, {0x1c0e7e5, 0x9}, {0xc000707560, 0x2, 0x2}, {0x1ee8c70, 0xc000106900}, ...)
2023-03-15T20:08:20.0761894Z    /home/runner/go/pkg/mod/github.com/onsi/ginkgo@v1.16.5/internal/suite/suite.go:79 +0x4d2
2023-03-15T20:08:20.0762474Z github.com/onsi/ginkgo.runSpecsWithCustomReporters({0x1ed1b00?, 0xc0003df1e0}, {0x1c0e7e5, 0x9}, {0xc000707540, 0x2, 0x1c26fb5?})
2023-03-15T20:08:20.0763046Z    /home/runner/go/pkg/mod/github.com/onsi/ginkgo@v1.16.5/ginkgo_dsl.go:245 +0x189
2023-03-15T20:08:20.0763642Z github.com/onsi/ginkgo.RunSpecsWithDefaultAndCustomReporters({0x1ed1b00, 0xc0003df1e0}, {0x1c0e7e5, 0x9}, {0xc000448a70, 0x1, 0x1})
2023-03-15T20:08:20.0764224Z    /home/runner/go/pkg/mod/github.com/onsi/ginkgo@v1.16.5/ginkgo_dsl.go:228 +0x1b6
2023-03-15T20:08:20.0764753Z github.com/ovn-org/ovn-kubernetes/test/e2e.TestE2e(0x0?)
2023-03-15T20:08:20.0765346Z    /home/runner/work/ovn-kubernetes/ovn-kubernetes/test/e2e/e2e_suite_test.go:71 +0x305
2023-03-15T20:08:20.0765732Z testing.tRunner(0xc0003df1e0, 0x1d28508)
2023-03-15T20:08:20.0766156Z    /opt/hostedtoolcache/go/1.18.4/x64/src/testing/testing.go:1439 +0x102
2023-03-15T20:08:20.0766507Z created by testing.(*T).Run
2023-03-15T20:08:20.0766910Z    /opt/hostedtoolcache/go/1.18.4/x64/src/testing/testing.go:1486 +0x35f
tssurya commented 1 year ago

https://github.com/ovn-org/ovn-kubernetes/pull/3494#issuecomment-1478483284 https://github.com/ovn-org/ovn-kubernetes/pull/3329#issuecomment-1474301470

trozet commented 1 year ago

Seeing it here too: https://github.com/ovn-org/ovn-kubernetes/actions/runs/4467757061/jobs/7848231662?pr=3334

tssurya commented 1 year ago

yeah it seems our counting of iptable rules has changed somehow?

2023-03-20T13:24:48.9764792Z Mar 20 13:24:48.976: FAIL: Couldn't fetch the correct number of iptable rules, err: timed out waiting for the condition
2023-03-20T13:24:48.9765167Z Unexpected error:
2023-03-20T13:24:48.9765477Z     <*errors.errorString | 0xc000206240>: {
2023-03-20T13:24:48.9765857Z         s: "timed out waiting for the condition",
2023-03-20T13:24:48.9766108Z     }
2023-03-20T13:24:48.9766605Z     timed out waiting for the condition
2023-03-20T13:24:48.9766824Z occurred
2023-03-20T13:24:48.9766936Z 

we really have not made any changes to code around this, only changes going in are those ready/serving/terminating status updates for endpoint slices, will do some runs locally to see what the issue is here, cause the count of rules is something we do in unit tests as well but we are seeing the unit tests passing.

tssurya commented 1 year ago

Investigated a failure: https://github.com/ovn-org/ovn-kubernetes/actions/runs/4536905491/jobs/7994555791?pr=3504 It seems to always happen when we scale down the deployment to 3 pods, perhaps we need to wait till the actual scale down happens.

Failure:

2023-03-27T23:14:37.3984839Z STEP: by sending a TCP packet to service lbservice-test with type=LoadBalancer in namespace default from backend pod lb-backend-pod
2023-03-27T23:14:38.0023886Z STEP: patching service lbservice-test to allocateLoadBalancerNodePorts=false and externalTrafficPolicy=local
2023-03-27T23:14:38.0620772Z Mar 27 23:14:38.061: INFO: Running '/usr/local/bin/kubectl --server=https://127.0.0.1:42351 --kubeconfig=/home/runner/ovn.conf --namespace=default get svc lbservice-test -o=jsonpath='{.spec.allocateLoadBalancerNodePorts}''
2023-03-27T23:14:38.2689282Z Mar 27 23:14:38.268: INFO: stderr: ""
2023-03-27T23:14:38.2690311Z Mar 27 23:14:38.268: INFO: stdout: "'false'"
2023-03-27T23:14:38.3107608Z Mar 27 23:14:38.310: INFO: Running '/usr/local/bin/kubectl --server=https://127.0.0.1:42351 --kubeconfig=/home/runner/ovn.conf --namespace=default get svc lbservice-test -o=jsonpath='{.spec.externalTrafficPolicy}''
2023-03-27T23:14:38.4885314Z Mar 27 23:14:38.488: INFO: stderr: ""
2023-03-27T23:14:38.4886650Z Mar 27 23:14:38.488: INFO: stdout: "'Local'"
2023-03-27T23:14:43.6556192Z STEP: by sending a TCP packet to service lbservice-test with type=LoadBalancer in namespace default from backend pod lb-backend-pod
2023-03-27T23:14:44.1910980Z STEP: Scale down endpoints of service: lbservice-test to ensure iptable rules are also getting recreated correctly
2023-03-27T23:14:44.1911942Z Mar 27 23:14:44.190: INFO: Running '/usr/local/bin/kubectl --server=https://127.0.0.1:42351 --kubeconfig=/home/runner/ovn.conf --namespace=default scale deployment lb-backend-pod --replicas=3'
2023-03-27T23:14:44.2766891Z Mar 27 23:14:44.275: INFO: stderr: ""
2023-03-27T23:14:44.2768358Z Mar 27 23:14:44.275: INFO: stdout: "deployment.apps/lb-backend-pod scaled\n"
2023-03-27T23:15:29.5296422Z Mar 27 23:15:29.523: FAIL: Couldn't fetch the correct number of iptable rules, err: timed out waiting for the condition
2023-03-27T23:15:29.5299087Z Unexpected error:
2023-03-27T23:15:29.5299827Z     <*errors.errorString | 0xc00020c260>: {
2023-03-27T23:15:29.5300282Z         s: "timed out waiting for the condition",
2023-03-27T23:15:29.5533832Z     }
2023-03-27T23:15:29.5535597Z     timed out waiting for the condition
2023-03-27T23:15:29.5536959Z occurred
2023-03-27T23:15:29.5538259Z 
2023-03-27T23:15:29.5539528Z Full Stack Trace

NOTE that the test already waits for 5 seconds before trying to fetch iptable rules.. So we scale down at 23:14:44 and we try to curl at 23:14:49 -> then we try for 40 seconds to reach the right number of rules and we time out?

LOGS show that rules were present at that time already! it takes only a couple of seconds to create them:

After the scale down: The update event is here:

2023-03-27T23:14:44.942098994Z stderr F I0327 23:14:44.941735   29451 obj_retry.go:508] Update event received for resource *factory.endpointSliceForGateway, old object is equal to new: false
2023-03-27T23:14:44.942111194Z stderr F I0327 23:14:44.941778   29451 obj_retry.go:560] Update event received for *factory.endpointSliceForGateway default/lbservice-test-ndfvk
2023-03-27T23:14:44.942114894Z stderr F I0327 23:14:44.941807   29451 healthcheck.go:224] Reporting 3 endpoints for healthcheck "default/lbservice-test"
2023-03-27T23:14:44.942118194Z stderr F I0327 23:14:44.941828   29451 gateway_shared_intf.go:941] Updating endpointslice lbservice-test-ndfvk in namespace default

Then the iptable rule deletion happens at:

2023-03-27T23:14:44.942128194Z stderr F I0327 23:14:44.941920   29451 gateway_iptables.go:118] Deleting rule in table: nat, chain: OVN-KUBE-EXTERNALIP with args: "-p TCP -d 192.168.10.0 --dport 80 -j DNAT --to-destination 10.96.105.102:80" for protocol: 0 
2023-03-27T23:14:44.942630599Z stderr F I0327 23:14:44.942475   29451 ovs.go:200] Exec(180): /usr/bin/ovs-ofctl -O OpenFlow13 --bundle replace-flows breth0 -
2023-03-27T23:14:44.976985848Z stderr F I0327 23:14:44.976872   29451 gateway_iptables.go:118] Deleting rule in table: nat, chain: OVN-KUBE-ETP with args: "-p TCP -d 192.168.10.0 --dport 80 -j DNAT --to-destination 10.244.1.12:80 -m statistic --mode random --probability 1.0000000000" for protocol: 0 
2023-03-27T23:14:44.977797256Z stderr F I0327 23:14:44.977721   29451 informer.go:316] Successfully synced 'default/lb-backend-pod-66998d8b47-6x9q7'
2023-03-27T23:14:44.990194082Z stderr F I0327 23:14:44.990118   29451 gateway_iptables.go:118] Deleting rule in table: nat, chain: OVN-KUBE-SNAT-MGMTPORT with args: "-p TCP -d 10.244.1.12 --dport 80 -j RETURN" for protocol: 0 
2023-03-27T23:14:45.006222744Z stderr F I0327 23:14:45.006148   29451 gateway_iptables.go:118] Deleting rule in table: nat, chain: OVN-KUBE-ETP with args: "-p TCP -d 192.168.10.0 --dport 80 -j DNAT --to-destination 10.244.1.14:80 -m statistic --mode random --probability 0.5000000000" for protocol: 0 
2023-03-27T23:14:45.016933653Z stderr F I0327 23:14:45.016862   29451 gateway_iptables.go:118] Deleting rule in table: nat, chain: OVN-KUBE-SNAT-MGMTPORT with args: "-p TCP -d 10.244.1.14 --dport 80 -j RETURN" for protocol: 0 
2023-03-27T23:14:45.036947956Z stderr F I0327 23:14:45.036846   29451 gateway_iptables.go:118] Deleting rule in table: nat, chain: OVN-KUBE-ETP with args: "-p TCP -d 192.168.10.0 --dport 80 -j DNAT --to-destination 10.244.1.15:80 -m statistic --mode random --probability 0.3333333333" for protocol: 0 
2023-03-27T23:14:45.041991907Z stderr F I0327 23:14:45.041920   29451 gateway_iptables.go:118] Deleting rule in table: nat, chain: OVN-KUBE-SNAT-MGMTPORT with args: "-p TCP -d 10.244.1.15 --dport 80 -j RETURN" for protocol: 0 
2023-03-27T23:14:45.043897526Z stderr F I0327 23:14:45.042092   29451 ovs.go:203] Exec(180): stdout: ""
2023-03-27T23:14:45.043908727Z stderr F I0327 23:14:45.043849   29451 ovs.go:204] Exec(180): stderr: ""
2023-03-27T23:14:45.054871138Z stderr F I0327 23:14:45.054770   29451 gateway_iptables.go:118] Deleting rule in table: nat, chain: OVN-KUBE-ETP with args: "-p TCP -d 192.168.10.0 --dport 80 -j DNAT --to-destination 10.244.1.13:80 -m statistic --mode random --probability 0.2500000000" for protocol: 0 
2023-03-27T23:14:45.058603276Z stderr F I0327 23:14:45.058525   29451 gateway_iptables.go:118] Deleting rule in table: nat, chain: OVN-KUBE-SNAT-MGMTPORT with args: "-p TCP -d 10.244.1.13 --dport 80 -j RETURN" for protocol: 0 
2023-03-27T23:14:45.085705451Z stderr F I0327 23:14:45.085624   29451 gateway_iptables.go:118] Deleting rule in table: nat, chain: OVN-KUBE-EXTERNALIP with args: "-p TCP -d 192.168.10.0 --dport 80 -j DNAT --to-destination 10.96.105.102:80" for protocol: 0 

where old ones are removed and new ones are added at:

2023-03-27T23:14:45.097069766Z stderr F I0327 23:14:45.097019   29451 gateway_iptables.go:71] Adding rule in table: nat, chain: OVN-KUBE-ETP with args: "-p TCP -d 192.168.10.0 --dport 80 -j DNAT --to-destination 10.244.1.14:80 -m statistic --mode random --probability 1.0000000000" for protocol: 0 
2023-03-27T23:14:45.097196567Z stderr F I0327 23:14:45.096896   29451 ovs.go:200] Exec(182): /usr/bin/ovs-ofctl -O OpenFlow13 --bundle replace-flows breth0 -
2023-03-27T23:14:45.100216798Z stderr F I0327 23:14:45.100125   29451 gateway_iptables.go:79] Chain: "OVN-KUBE-ETP" in table: "nat" already exists, skipping creation: running [/usr/sbin/iptables -t nat -N OVN-KUBE-ETP --wait]: exit status 1: iptables: Chain already exists.
2023-03-27T23:14:45.105511051Z stderr F I0327 23:14:45.105431   29451 gateway_iptables.go:71] Adding rule in table: nat, chain: OVN-KUBE-SNAT-MGMTPORT with args: "-p TCP -d 10.244.1.14 --dport 80 -j RETURN" for protocol: 0 
2023-03-27T23:14:45.108068877Z stderr F I0327 23:14:45.107984   29451 gateway_iptables.go:79] Chain: "OVN-KUBE-SNAT-MGMTPORT" in table: "nat" already exists, skipping creation: running [/usr/sbin/iptables -t nat -N OVN-KUBE-SNAT-MGMTPORT --wait]: exit status 1: iptables: Chain already exists.
2023-03-27T23:14:45.119371292Z stderr F I0327 23:14:45.119290   29451 gateway_iptables.go:71] Adding rule in table: nat, chain: OVN-KUBE-ETP with args: "-p TCP -d 192.168.10.0 --dport 80 -j DNAT --to-destination 10.244.1.15:80 -m statistic --mode random --probability 0.5000000000" for protocol: 0 
2023-03-27T23:14:45.124025639Z stderr F I0327 23:14:45.123941   29451 gateway_iptables.go:79] Chain: "OVN-KUBE-ETP" in table: "nat" already exists, skipping creation: running [/usr/sbin/iptables -t nat -N OVN-KUBE-ETP --wait]: exit status 1: iptables: Chain already exists.
2023-03-27T23:14:45.125774957Z stderr F I0327 23:14:45.125699   29451 ovs.go:203] Exec(182): stdout: ""
2023-03-27T23:14:45.125887758Z stderr F I0327 23:14:45.125835   29451 ovs.go:204] Exec(182): stderr: ""
2023-03-27T23:14:45.128729087Z stderr F I0327 23:14:45.128665   29451 gateway_iptables.go:71] Adding rule in table: nat, chain: OVN-KUBE-SNAT-MGMTPORT with args: "-p TCP -d 10.244.1.15 --dport 80 -j RETURN" for protocol: 0 
2023-03-27T23:14:45.131326313Z stderr F I0327 23:14:45.131227   29451 gateway_iptables.go:79] Chain: "OVN-KUBE-SNAT-MGMTPORT" in table: "nat" already exists, skipping creation: running [/usr/sbin/iptables -t nat -N OVN-KUBE-SNAT-MGMTPORT --wait]: exit status 1: iptables: Chain already exists.
2023-03-27T23:14:45.137787079Z stderr F I0327 23:14:45.137712   29451 gateway_iptables.go:71] Adding rule in table: nat, chain: OVN-KUBE-ETP with args: "-p TCP -d 192.168.10.0 --dport 80 -j DNAT --to-destination 10.244.1.12:80 -m statistic --mode random --probability 0.3333333333" for protocol: 0 
2023-03-27T23:14:45.140301304Z stderr F I0327 23:14:45.140199   29451 gateway_iptables.go:79] Chain: "OVN-KUBE-ETP" in table: "nat" already exists, skipping creation: running [/usr/sbin/iptables -t nat -N OVN-KUBE-ETP --wait]: exit status 1: iptables: Chain already exists.
2023-03-27T23:14:45.147173574Z stderr F I0327 23:14:45.147092   29451 gateway_iptables.go:71] Adding rule in table: nat, chain: OVN-KUBE-SNAT-MGMTPORT with args: "-p TCP -d 10.244.1.12 --dport 80 -j RETURN" for protocol: 0 
2023-03-27T23:14:45.149623599Z stderr F I0327 23:14:45.149541   29451 gateway_iptables.go:79] Chain: "OVN-KUBE-SNAT-MGMTPORT" in table: "nat" already exists, skipping creation: running [/usr/sbin/iptables -t nat -N OVN-KUBE-SNAT-MGMTPORT --wait]: exit status 1: iptables: Chain already exists.
2023-03-27T23:14:45.155957363Z stderr F I0327 23:14:45.155877   29451 gateway_iptables.go:71] Adding rule in table: nat, chain: OVN-KUBE-EXTERNALIP with args: "-p TCP -d 192.168.10.0 --dport 80 -j DNAT --to-destination 10.96.105.102:80" for protocol: 0 

The rules are already present at this time!! I wonder if my iptable rule counting logic changed somehow :/

tssurya commented 1 year ago

Investigated another failure: https://github.com/ovn-org/ovn-kubernetes/actions/runs/4513426760?pr=3430

Failure:

2023-03-24T19:14:48.9030228Z Mar 24 19:14:48.902: INFO: stdout: "'Local'"
2023-03-24T19:14:54.0441544Z STEP: by sending a TCP packet to service lbservice-test with type=LoadBalancer in namespace default from backend pod lb-backend-pod
2023-03-24T19:14:54.5158042Z STEP: Scale down endpoints of service: lbservice-test to ensure iptable rules are also getting recreated correctly
2023-03-24T19:14:54.5159298Z Mar 24 19:14:54.515: INFO: Running '/usr/local/bin/kubectl --server=https://127.0.0.1:37945 --kubeconfig=/home/runner/ovn.conf --namespace=default scale deployment lb-backend-pod --replicas=3'
2023-03-24T19:14:54.6186601Z Mar 24 19:14:54.618: INFO: stderr: ""
2023-03-24T19:14:54.6187721Z Mar 24 19:14:54.618: INFO: stdout: "deployment.apps/lb-backend-pod scaled\n"
2023-03-24T19:15:39.8338143Z Mar 24 19:15:39.832: FAIL: Couldn't fetch the correct number of iptable rules, err: timed out waiting for the condition
2023-03-24T19:15:39.8338942Z Unexpected error:
2023-03-24T19:15:39.8340609Z     <*errors.errorString | 0xc00018a280>: {
2023-03-24T19:15:39.8341198Z         s: "timed out waiting for the condition",
2023-03-24T19:15:39.8341574Z     }
2023-03-24T19:15:39.8419901Z     timed out waiting for the condition
2023-03-24T19:15:39.8420156Z occurred

NOTE that the test already waits for 5 seconds before trying to fetch iptable rules.. So we scale down at 19:14:54 and we try to curl at 19:14:59 -> then we try for 40 seconds to reach the right number of rules and we time out?

LOGS show that rules were present at that time already! it takes only a couple of seconds to create them:

Update after scale down:

2023-03-24T19:14:55.363025103Z stderr F I0324 19:14:55.362982   34855 obj_retry.go:560] Update event received for *factory.endpointSliceForGateway default/lbservice-test-khdfk
2023-03-24T19:14:55.363104303Z stderr F I0324 19:14:55.363064   34855 healthcheck.go:224] Reporting 3 endpoints for healthcheck "default/lbservice-test"
2023-03-24T19:14:55.365332698Z stderr F I0324 19:14:55.363143   34855 gateway_shared_intf.go:941] Updating endpointslice lbservice-test-khdfk in namespace default

After this:

2023-03-24T19:14:55.366097997Z stderr F I0324 19:14:55.366066   34855 gateway_iptables.go:118] Deleting rule in table: nat, chain: OVN-KUBE-EXTERNALIP with args: "-p TCP -d 192.168.10.0 --dport 80 -j DNAT --to-destination 10.96.209.170:80" for protocol: 0 
2023-03-24T19:14:55.366495496Z stderr F I0324 19:14:55.366443   34855 ovs.go:200] Exec(200): /usr/bin/ovs-ofctl -O OpenFlow13 --bundle replace-flows breth0 -
2023-03-24T19:14:55.381001768Z stderr F I0324 19:14:55.380014   34855 gateway_iptables.go:118] Deleting rule in table: nat, chain: OVN-KUBE-ETP with args: "-p TCP -d 192.168.10.0 --dport 80 -j DNAT --to-destination 10.244.2.17:80 -m statistic --mode random --probability 1.0000000000" for protocol: 0 
2023-03-24T19:14:55.405328322Z stderr F I0324 19:14:55.404592   34855 informer.go:316] Successfully synced 'default/lb-backend-pod-66998d8b47-2jkbh'
2023-03-24T19:14:55.408072617Z stderr F I0324 19:14:55.406680   34855 gateway_iptables.go:118] Deleting rule in table: nat, chain: OVN-KUBE-SNAT-MGMTPORT with args: "-p TCP -d 10.244.2.17 --dport 80 -j RETURN" for protocol: 0 
2023-03-24T19:14:55.438442859Z stderr F I0324 19:14:55.438327   34855 gateway_iptables.go:118] Deleting rule in table: nat, chain: OVN-KUBE-ETP with args: "-p TCP -d 192.168.10.0 --dport 80 -j DNAT --to-destination 10.244.2.16:80 -m statistic --mode random --probability 0.5000000000" for protocol: 0 
2023-03-24T19:14:55.450648636Z stderr F I0324 19:14:55.450560   34855 gateway_iptables.go:118] Deleting rule in table: nat, chain: OVN-KUBE-SNAT-MGMTPORT with args: "-p TCP -d 10.244.2.16 --dport 80 -j RETURN" for protocol: 0 
2023-03-24T19:14:55.455197127Z stderr F I0324 19:14:55.455111   34855 ovs.go:203] Exec(200): stdout: ""
2023-03-24T19:14:55.455287527Z stderr F I0324 19:14:55.455253   34855 ovs.go:204] Exec(200): stderr: ""
2023-03-24T19:14:55.455346927Z stderr F I0324 19:14:55.455320   34855 ovs.go:200] Exec(201): /usr/bin/ovs-ofctl -O OpenFlow13 --bundle replace-flows breth1 -
2023-03-24T19:14:55.460526217Z stderr F I0324 19:14:55.460449   34855 gateway_iptables.go:118] Deleting rule in table: nat, chain: OVN-KUBE-ETP with args: "-p TCP -d 192.168.10.0 --dport 80 -j DNAT --to-destination 10.244.2.19:80 -m statistic --mode random --probability 0.3333333333" for protocol: 0 
2023-03-24T19:14:55.467810903Z stderr F I0324 19:14:55.467725   34855 gateway_iptables.go:118] Deleting rule in table: nat, chain: OVN-KUBE-SNAT-MGMTPORT with args: "-p TCP -d 10.244.2.19 --dport 80 -j RETURN" for protocol: 0 
2023-03-24T19:14:55.477499784Z stderr F I0324 19:14:55.477402   34855 gateway_iptables.go:118] Deleting rule in table: nat, chain: OVN-KUBE-ETP with args: "-p TCP -d 192.168.10.0 --dport 80 -j DNAT --to-destination 10.244.2.18:80 -m statistic --mode random --probability 0.2500000000" for protocol: 0 
2023-03-24T19:14:55.482828374Z stderr F I0324 19:14:55.482703   34855 gateway_iptables.go:118] Deleting rule in table: nat, chain: OVN-KUBE-SNAT-MGMTPORT with args: "-p TCP -d 10.244.2.18 --dport 80 -j RETURN" for protocol: 0 
2023-03-24T19:14:55.483155974Z stderr F I0324 19:14:55.483087   34855 ovs.go:203] Exec(201): stdout: ""
2023-03-24T19:14:55.483233273Z stderr F I0324 19:14:55.483192   34855 ovs.go:204] Exec(201): stderr: ""
2023-03-24T19:14:55.48997956Z stderr F I0324 19:14:55.489895   34855 gateway_iptables.go:118] Deleting rule in table: nat, chain: OVN-KUBE-EXTERNALIP with args: "-p TCP -d 192.168.10.0 --dport 80 -j DNAT --to-destination 10.96.209.170:80" for protocol: 0 
2023-03-24T19:14:55.493634854Z stderr F I0324 19:14:55.493566   34855 ovs.go:200] Exec(202): /usr/bin/ovs-ofctl show breth0
2023-03-24T19:14:55.50085844Z stderr F I0324 19:14:55.500732   34855 ovs.go:203] Exec(202): stdout: "OFPT_FEATURES_REPLY (xid=0x2): dpid:00000242ac120003\nn_tables:254, n_buffers:0\ncapabilities: FLOW_STATS TABLE_STATS PORT_STATS QUEUE_STATS ARP_MATCH_IP\nactions: output enqueue set_vlan_vid set_vlan_pcp strip_vlan mod_dl_src mod_dl_dst mod_nw_src mod_nw_dst mod_nw_tos mod_tp_src mod_tp_dst\n 1(eth0): addr:02:42:ac:12:00:03\n     config:     0\n     state:      0\n     current:    10GB-FD COPPER\n     speed: 10000 Mbps now, 0 Mbps max\n 2(patch-breth0_ov): addr:fa:c6:86:4e:75:17\n     config:     0\n     state:      0\n     speed: 0 Mbps now, 0 Mbps max\n LOCAL(breth0): addr:02:42:ac:12:00:03\n     config:     0\n     state:      0\n     speed: 0 Mbps now, 0 Mbps max\nOFPT_GET_CONFIG_REPLY (xid=0x4): frags=normal miss_send_len=0\n"
2023-03-24T19:14:55.50094934Z stderr F I0324 19:14:55.500909   34855 ovs.go:204] Exec(202): stderr: ""
2023-03-24T19:14:55.501037839Z stderr F I0324 19:14:55.501005   34855 openflow_manager.go:47] Gateway OpenFlow sync requested
2023-03-24T19:14:55.501176339Z stderr F I0324 19:14:55.501138   34855 gateway_iptables.go:71] Adding rule in table: nat, chain: OVN-KUBE-ETP with args: "-p TCP -d 192.168.10.0 --dport 80 -j DNAT --to-destination 10.244.2.19:80 -m statistic --mode random --probability 1.0000000000" for protocol: 0 
2023-03-24T19:14:55.501414439Z stderr F I0324 19:14:55.501086   34855 ovs.go:200] Exec(203): /usr/bin/ovs-ofctl -O OpenFlow13 --bundle replace-flows breth0 -
2023-03-24T19:14:55.504522433Z stderr F I0324 19:14:55.504447   34855 gateway_iptables.go:79] Chain: "OVN-KUBE-ETP" in table: "nat" already exists, skipping creation: running [/usr/sbin/iptables -t nat -N OVN-KUBE-ETP --wait]: exit status 1: iptables: Chain already exists.
2023-03-24T19:14:55.512008618Z stderr F I0324 19:14:55.511937   34855 gateway_iptables.go:71] Adding rule in table: nat, chain: OVN-KUBE-SNAT-MGMTPORT with args: "-p TCP -d 10.244.2.19 --dport 80 -j RETURN" for protocol: 0 
2023-03-24T19:14:55.515067813Z stderr F I0324 19:14:55.514997   34855 gateway_iptables.go:79] Chain: "OVN-KUBE-SNAT-MGMTPORT" in table: "nat" already exists, skipping creation: running [/usr/sbin/iptables -t nat -N OVN-KUBE-SNAT-MGMTPORT --wait]: exit status 1: iptables: Chain already exists.
2023-03-24T19:14:55.522284799Z stderr F I0324 19:14:55.522200   34855 gateway_iptables.go:71] Adding rule in table: nat, chain: OVN-KUBE-ETP with args: "-p TCP -d 192.168.10.0 --dport 80 -j DNAT --to-destination 10.244.2.17:80 -m statistic --mode random --probability 0.5000000000" for protocol: 0 
2023-03-24T19:14:55.525406593Z stderr F I0324 19:14:55.525335   34855 gateway_iptables.go:79] Chain: "OVN-KUBE-ETP" in table: "nat" already exists, skipping creation: running [/usr/sbin/iptables -t nat -N OVN-KUBE-ETP --wait]: exit status 1: iptables: Chain already exists.
2023-03-24T19:14:55.539885565Z stderr F I0324 19:14:55.539768   34855 gateway_iptables.go:71] Adding rule in table: nat, chain: OVN-KUBE-SNAT-MGMTPORT with args: "-p TCP -d 10.244.2.17 --dport 80 -j RETURN" for protocol: 0 
2023-03-24T19:14:55.543391859Z stderr F I0324 19:14:55.543293   34855 gateway_iptables.go:79] Chain: "OVN-KUBE-SNAT-MGMTPORT" in table: "nat" already exists, skipping creation: running [/usr/sbin/iptables -t nat -N OVN-KUBE-SNAT-MGMTPORT --wait]: exit status 1: iptables: Chain already exists.
2023-03-24T19:14:55.546847452Z stderr F I0324 19:14:55.546764   34855 ovs.go:203] Exec(203): stdout: ""
2023-03-24T19:14:55.546926852Z stderr F I0324 19:14:55.546891   34855 ovs.go:204] Exec(203): stderr: ""
2023-03-24T19:14:55.547167951Z stderr F I0324 19:14:55.547124   34855 ovs.go:200] Exec(204): /usr/bin/ovs-ofctl -O OpenFlow13 --bundle replace-flows breth1 -
2023-03-24T19:14:55.554118938Z stderr F I0324 19:14:55.554044   34855 gateway_iptables.go:71] Adding rule in table: nat, chain: OVN-KUBE-ETP with args: "-p TCP -d 192.168.10.0 --dport 80 -j DNAT --to-destination 10.244.2.16:80 -m statistic --mode random --probability 0.3333333333" for protocol: 0 
2023-03-24T19:14:55.557199932Z stderr F I0324 19:14:55.557112   34855 gateway_iptables.go:79] Chain: "OVN-KUBE-ETP" in table: "nat" already exists, skipping creation: running [/usr/sbin/iptables -t nat -N OVN-KUBE-ETP --wait]: exit status 1: iptables: Chain already exists.
2023-03-24T19:14:55.565269317Z stderr F I0324 19:14:55.565183   34855 gateway_iptables.go:71] Adding rule in table: nat, chain: OVN-KUBE-SNAT-MGMTPORT with args: "-p TCP -d 10.244.2.16 --dport 80 -j RETURN" for protocol: 0 
2023-03-24T19:14:55.56877151Z stderr F I0324 19:14:55.568681   34855 gateway_iptables.go:79] Chain: "OVN-KUBE-SNAT-MGMTPORT" in table: "nat" already exists, skipping creation: running [/usr/sbin/iptables -t nat -N OVN-KUBE-SNAT-MGMTPORT --wait]: exit status 1: iptables: Chain already exists.
2023-03-24T19:14:55.573159702Z stderr F I0324 19:14:55.573076   34855 ovs.go:203] Exec(204): stdout: ""
2023-03-24T19:14:55.573231202Z stderr F I0324 19:14:55.573201   34855 ovs.go:204] Exec(204): stderr: ""
2023-03-24T19:14:55.575711497Z stderr F I0324 19:14:55.575625   34855 gateway_iptables.go:71] Adding rule in table: nat, chain: OVN-KUBE-EXTERNALIP with args: "-p TCP -d 192.168.10.0 --dport 80 -j DNAT --to-destination 10.96.209.170:80" for protocol: 0 

We are done in 1 second actually! so why did we fail in the test?

tssurya commented 1 year ago

Running this test locally always seems to pass.

tssurya commented 1 year ago

LOL unit tests are also failing in this space: https://github.com/ovn-org/ovn-kubernetes/actions/runs/4563583609/jobs/8052256701?pr=3512

tssurya commented 1 year ago

for sure something around how we process endpoints has changed:

2023-03-30T11:09:22.0027083Z • Failure [0.130 seconds]
2023-03-30T11:09:22.0027339Z Node Operations
2023-03-30T11:09:22.0027866Z /home/runner/work/ovn-kubernetes/ovn-kubernetes/go-controller/pkg/node/gateway_localnet_linux_test.go:219
2023-03-30T11:09:22.0028191Z   on add
2023-03-30T11:09:22.0028753Z   /home/runner/work/ovn-kubernetes/ovn-kubernetes/go-controller/pkg/node/gateway_localnet_linux_test.go:377
2023-03-30T11:09:22.0030585Z     inits iptables rules and openflows with LoadBalancer where AllocateLoadBalancerNodePorts=False, ETP=local, LGW mode [It]
2023-03-30T11:09:22.0032168Z     /home/runner/work/ovn-kubernetes/ovn-kubernetes/go-controller/pkg/node/gateway_localnet_linux_test.go:837
2023-03-30T11:09:22.0032457Z 
2023-03-30T11:09:22.0032705Z     Expected success, but got an error:
2023-03-30T11:09:22.0033363Z         <*errors.errorString | 0xc0004bddd0>: {
2023-03-30T11:09:22.0035620Z             s: "expected rule \"-p TCP -d 10.244.0.3 --dport 8080 -j RETURN\" at pos 0 in chain nat/OVN-KUBE-SNAT-MGMTPORT, got \"-p TCP -d 10.244.0.4 --dport 8080 -j RETURN\"",
2023-03-30T11:09:22.0036059Z         }
2023-03-30T11:09:22.0037241Z         expected rule "-p TCP -d 10.244.0.3 --dport 8080 -j RETURN" at pos 0 in chain nat/OVN-KUBE-SNAT-MGMTPORT, got "-p TCP -d 10.244.0.4 --dport 8080 -j RETURN"
2023-03-30T11:09:22.0037546Z 
2023-03-30T11:09:22.0038299Z     /home/runner/work/ovn-kubernetes/ovn-kubernetes/go-controller/pkg/node/gateway_localnet_linux_test.go:957
tssurya commented 1 year ago

https://github.com/ovn-org/ovn-kubernetes/actions/runs/4565571320/jobs/8056873798?pr=3512