flaky test: NetworkPolicy between server and client [It] should enforce egress policy allowing traffic to a server in a different namespace based on PodSelector and NamespaceSelector

dave-tucker commented 4 years ago

[Fail] [sig-network] NetworkPolicy [LinuxOnly] NetworkPolicy between server and client [It] should enforce egress policy allowing traffic to a server in a different namespace based on PodSelector and NamespaceSelector [Feature:NetworkPolicy]

Seen on:

alexanderConstantinescu commented 4 years ago

FYI:

The flake is due to the client-a pod (which tests connectivity to another namespace) not being created properly.

I have run a couple of test runs on my computer and attached two journal logs from a successful / unsuccessful run (with all services from around the time the client-a gets created to when it stops getting updates).

When anybody gets a chance to dig into them please have a look at what the diff is, from my initial analysis the diff seems to be:

Run that's OK

May 07 10:10:17 ovn-control-plane2 containerd[28673]: time="2020-05-07T10:10:17.055172326Z" level=debug msg="Finish redirecting log file \"/var/log/pods/network-policy-6170_client-a-g98bk_e479afd4-a46e-4352-90fc-6a252fcc1bf5/client-a-container/0.log\", closing it"
May 07 10:10:17 ovn-control-plane2 containerd[28673]: time="2020-05-07T10:10:17.060118028Z" level=debug msg="event forwarded" ns=k8s.io topic=/tasks/exit type=containerd.events.TaskExit
May 07 10:10:17 ovn-control-plane2 containerd[28673]: time="2020-05-07T10:10:17.060115545Z" level=debug msg="Received exit event &TaskExit{ContainerID:805a58e102a63ea9f2de02c7149e12373c0558ff06b8127a9044a779a418e829,ID:805a58e102a63ea9f2de02c7149e12373c0558ff06b8127a9044a779a418e829,Pid:52206,ExitStatus:0,ExitedAt:2020-05-07 10:10:17.059549355 +0000 UTC,XXX_unrecognized:[],}"

Run that's NOT OK

May 07 09:36:31 ovn-control-plane2 containerd[28673]: time="2020-05-07T09:36:31.152802185Z" level=debug msg="Finish redirecting log file \"/var/log/pods/network-policy-2565_client-a-phfgt_a1404e3f-f0e4-4fa0-8be1-1a9f4ae7bc5f/client-a-container/0.log\", closing it"
May 07 09:36:31 ovn-control-plane2 containerd[28673]: time="2020-05-07T09:36:31.153834362Z" level=debug msg="event forwarded" ns=k8s.io topic=/tasks/exit type=containerd.events.TaskExit
May 07 09:36:31 ovn-control-plane2 containerd[28673]: time="2020-05-07T09:36:31.153846441Z" level=debug msg="Received exit event &TaskExit{ContainerID:20eb99499bba812d48db91f821eaf1323d4b49d6a51c4ab531ecba625016f039,ID:20eb99499bba812d48db91f821eaf1323d4b49d6a51c4ab531ecba625016f039,Pid:33185,ExitStatus:1,ExitedAt:2020-05-07 09:36:31.153708792 +0000 UTC,XXX_unrecognized:[],}"

What affects this ExitStatus, I have not been able to figure out. notworking.log working.log

trozet commented 4 years ago

Thanks @alexanderConstantinescu for looking at it. I'll ping you offline and also take a look.

trozet commented 4 years ago

It looks like a bug in the test itself. I pushed a fix: https://github.com/kubernetes/kubernetes/pull/91025

We were probably "passing" the first connectivity test due to a race (it should have always failed):

Test configures policy to allow DNS
Test creates client to connect to server on TCP port 80 and thinks this should work (should not work)
The test tries for a few seconds to see if it can connect, if it can proceeds, otherwise fails (sometimes this will pass because there are a few seconds we need to configure OVN and OVS with the policy ACL)

trozet commented 4 years ago

So after fixing the test and running it again, we now fail to actually pass the test in the right spot:

create client-a in ns A with policy that applies to client-a, egress, allow traffic to ns selector ns B and pod selector ns-b-server-a
create nsBserviceA, nsBserviceB services. Create nsBpodServerA, create nsBpodServerB
test client-a ->nsb-service-a (now works)
test client-a ->nsb-service-b (works, but should fail)

Looking at our egress policy ACL: _uuid : 80e0357a-d4fe-4bce-9b8b-cf748629c681 action : allow direction : to-lport external_ids : {Egress_num="0", ipblock_cidr="false", l4Match="tcp && tcp.dst==80", namespace=network-policy-4956, policy=allow-to-ns-b-server-a-via-namespace-selector, policy_type=Egress} log : false match : "ip4 && tcp && tcp.dst==80 && inport == @a5571284678637270999" meter : [] name : [] priority : 1001 severity : []

This ACL is not right we should also have L3 match there for the address set that includes ns-b-server-a addr. @dcbw @danwinship FYI

trozet commented 4 years ago

Nevermind I was reading the test wrong. There's actually 2 policies (one to allow dns, and one to match on NS and pod). So the test is correct and so is our policy:

match : "ip4.dst == {$a4006516561677970445} && inport == @a7563830714843023450"

So wondering if this is a race where the default ACL is applied but the allow ACL hasnt been added yet

trozet commented 4 years ago

It looks like the ACL update is in place on ovn-nbctl before the pod starts to try to connect:

I0505 15:48:59.282716 51 ovs.go:246] exec(1095): /usr/bin/ovn-nbctl --timeout=15 --id=@acl create acl priority=1001 direction=to-lport match="ip4.dst == {$a8633917019715949912} && inport == @a9177897342901698129" action=allow external-ids:l4Match="None" external-ids:ipblock_cidr=false external-ids:namespace=network-policy-6549 external-ids:policy=allow-to-ns-b-server-a-via-namespace-selector external-ids:Egress_num=1 external-ids:policy_type=Egress -- add port_group e690f88b-2088-4130-9da9-dcb25f7f6a1f acls @acl

I0505 15:48:59.373779 51 ovs.go:246] exec(1101): /usr/bin/ovn-nbctl --timeout=15 --if-exists remove port_group e690f88b-2088-4130-9da9-dcb25f7f6a1f ports c73c685a-e75c-4716-b608-cb65f762c14c -- add port_group e690f88b-2088-4130-9da9-dcb25f7f6a1f ports c73c685a-e75c-4716-b608-cb65f762c14c

2020-05-05T15:49:07.3007683Z May 5 15:49:07.300: INFO: Waiting up to 5m0s for pod "client-a-sph8m" in namespace "network-policy-6549" to be "success or failure" 2020-05-05T15:49:07.3036810Z May 5 15:49:07.303: INFO: Pod "client-a-sph8m": Phase="Failed", Reason="", readiness=false. Elapsed: 3.061593ms

I guess the OVS flows could still not be present, but even so, the client pod should retry a bunch of times:

fmt.Sprintf("for i in $(seq 1 5); do nc -vz -w 8 %s.%s %d && exit 0 || sleep 1; done; exit 1", targetService.Name, targetService.Namespace, targetPort),

So I'm not sure why the pod exits so early...

danwinship commented 4 years ago

The test code is weird: it calls e2epod.WaitForPodNoLongerRunningInNamespace and then calls e2epod.WaitForPodSuccessInNamespace, but since the pod is no longer running after the first call, the second call is guaranteed to complete immediately. So the "Elapsed: 3.061593ms" is just telling you how long it took to query the pod status. From looking at the timestamps of the timestamped log messages, it seems like the pod runs for about 10 seconds, suggesting each of the 5 attempts fails after 1 second (and is then followed by a sleep 1)

trozet commented 4 years ago

thanks @danwinship so if that it is the case and we know we configured ovn-nbctl in time, then either the flows in OVS are not updated within 10 seconds (hard to believe) or there is a bug with the flows getting installed. I guess we will need to reproduce the failure and get a debuggable setup so we can look at the OVS flows. I'll try a few times to recreate the policy and see if I can trigger the failure, but I haven't been able to reproduce so far locally.

alexanderConstantinescu commented 4 years ago

Another word on this:

I think @trozet is right, which is why I have a hard time understanding how ovn-kubernetes is responsible for this (if anything I am inclined to blame kind).

If you look at this run:

https://github.com/ovn-org/ovn-kubernetes/pull/1306/checks?check_run_id=642043097

This is the bit that is interesting from that run:

STEP: Creating client pod client-a that should successfully connect to svc-ns-b-server-a.
May  4 07:57:42.076: INFO: Waiting for client-a-qqz7l to complete.
May  4 07:57:50.085: INFO: Waiting for client-a-qqz7l to complete.
May  4 07:57:50.085: INFO: Waiting up to 5m0s for pod "client-a-qqz7l" in namespace "network-policy-9991" to be "success or failure"
May  4 07:57:50.089: INFO: Pod "client-a-qqz7l": Phase="Failed", Reason="", readiness=false. Elapsed: 4.418846ms
May  4 07:57:50.099: FAIL: Error getting container logs: the server rejected our request for an unknown reason (get pods client-a-qqz7l)

Full Stack Trace
k8s.io/kubernetes/test/e2e/network.checkConnectivity(0xc000d65540, 0xc0006b0dc0, 0xc00123ac00, 0xc00030ed80)
    /home/runner/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/network/network_policy.go:1420 +0x415

May 4 07:57:42.076: INFO: Waiting for client-a-qqz7l to complete. is logged here: https://github.com/kubernetes/kubernetes/blob/v1.17.2/test/e2e/network/network_policy.go#L1410

May 4 07:57:50.085: INFO: Waiting for client-a-qqz7l to complete. is then logged here: https://github.com/kubernetes/kubernetes/blob/v1.17.2/test/e2e/network/network_policy.go#L1414

May 4 07:57:50.085: INFO: Waiting up to 5m0s for pod "client-a-qqz7l" in namespace "network-policy-9991" to be "success or failure" is the logged here: https://github.com/kubernetes/kubernetes/blob/v1.17.2/test/e2e/framework/pod/wait.go#L213

The pod then fails instantaneously (for some unknown reason, as we can see by May 4 07:57:50.089: INFO: Pod "client-a-qqz7l": Phase="Failed", Reason="", readiness=false. Elapsed: 4.418846ms) here: https://github.com/kubernetes/kubernetes/blob/v1.17.2/test/e2e/framework/pod/wait.go#L267 and thus returning an error here: https://github.com/kubernetes/kubernetes/blob/v1.17.2/test/e2e/framework/pod/wait.go#L231

Which then gets here: https://github.com/kubernetes/kubernetes/blob/v1.17.2/test/e2e/network/network_policy.go#L1418 which then Fails the test here because its request to fetch the container logs is rejected: https://github.com/kubernetes/kubernetes/blob/v1.17.2/test/e2e/network/network_policy.go#L1420

I just don't see how can this be dependent of OVS / OVN flows? If anything it's "kind" that has a bug somewhere for certain e2e tests.

The connectivity test is dependent upon the fact that the pod does not see the status v1.PodFailed, which it does and thus stops retrying the connectivity. And as Tim pointed out, ovn-kube even logs the ovn-nbctl transaction before the client-a pod is up, which makes for a stronger argument that ovn-kubernetes is not the cause of this.

danwinship commented 4 years ago

The pod then fails instantaneously

It's not failing instantaneously; this is what I was saying above about the test code being weird. It logged "Waiting for client-a-qqz7l to complete." at 07:57:42, and then waited for it to complete. The next message was at 07:57:50, 8 seconds later, after the pod completed. The code then waits for it to complete again, but since the pod has already completed, the second "wait" succeeds immediately.

(I'm guessing that the two different Wait functions it uses have slightly different semantics in some way that makes it useful to use the former function to actually wait for the pod to complete, and then uses the second function solely for the side effect of returning a useful error message if the pod failed. EDIT: or it's just a goofy cut+paste error that no one ever fixed.)

which then Fails the test here because its request to fetch the container logs is rejected

yes, that's another bug in the test case which dcbw already fixed upstream. (Maybe we should fix the kind environment to use a newer k8s build?) At any rate, we can mostly infer what the missing logs are based on the other information we've already figured out.

The connectivity test is dependent upon the fact that the pod does not see the status v1.PodFailed, which it does and thus stops retrying the connectivity.

The pod is failing after 8 seconds, which is consistent with it making the 5 attempts, each 1 second apart, and then exiting with an error because it failed to connect.

So I think, either it's taking ovn-kubernetes longer than 8 seconds to set up the right flows, or else the flows it's setting up are not correct.

danwinship commented 4 years ago

So I think, either it's taking ovn-kubernetes longer than 8 seconds to set up the right flows, or else the flows it's setting up are not correct.

The test also requires that ovn-kube is setting up correct flows both for the client-to-server HTTP connection and for client-to-CoreDNS so maybe we're screwing up the latter?

trozet commented 4 years ago

@danwinship @alexanderConstantinescu is going to try modifying the test to return a special exit code (rather than just 1). That way when he sees it fail we will know if the pod failed due to the test failing (our problem) or if it is just containerd something else caused it to exit.

danwinship commented 4 years ago

if you're going to modify k8s, just get the fix for the fetching container logs bug (https://github.com/kubernetes/kubernetes/pull/89592). Then you should get the output of the client pod so you can see how far it got (and what specific curl error messages it got).

github-actions[bot] commented 21 hours ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 5 days.

ovn-org / ovn-kubernetes

flaky test: NetworkPolicy between server and client [It] should enforce egress policy allowing traffic to a server in a different namespace based on PodSelector and NamespaceSelector #1321