submariner-io / lighthouse

DNS service discovery across connected Kubernetes clusters.
https://submariner-io.github.io/architecture/service-discovery/
Apache License 2.0
103 stars 35 forks source link

Helm jobs broke on last commit #504

Closed dfarrell07 closed 3 years ago

dfarrell07 commented 3 years ago

What happened:

It seems like the most recently merged PR broke the Lighthouse+Helm jobs.

In the flake finder, the jobs from 5 days and before were all passing:

https://github.com/submariner-io/lighthouse/actions/workflows/flake_finder.yml

https://github.com/submariner-io/lighthouse/actions/runs/731556462

The jobs 4 days and more recently are all failing in the same way:

https://github.com/submariner-io/lighthouse/actions/runs/734764600

In the PR-triggered E2E, the PR before the one in question passed:

https://github.com/submariner-io/lighthouse/pull/501

The PR-triggered E2E on the PR in question failed, but the PR was merged:

https://github.com/submariner-io/lighthouse/pull/502

Reverting the PR fixes the Helm jobs:

https://github.com/dfarrell07/lighthouse/pull/1

As compared to PRs with the same base run at about the same time, where the Helm jobs fail:

https://github.com/submariner-io/lighthouse/pull/503

There were no PRs merged to the Helm repo in the relevant timeframe.

From what I can see of the logs, the nginx connectivity tests pass and the first E2E test fails.

2021-04-09T10:07:13.5731683Z [e2e]$ go test -v -timeout 30m -args -ginkgo.v -ginkgo.randomizeAllSpecs -ginkgo.trace -submariner-namespace submariner-operator -dp-context cluster1 -dp-context cluster2 -dp-context cluster3 -ginkgo.reportPassed -test.timeout 15m -ginkgo.reportFile /go/src/github.com/submariner-io/lighthouse/output/e2e-junit.xml
2021-04-09T10:07:13.5746162Z [e2e]$ tee /go/src/github.com/submariner-io/lighthouse/output/e2e-tests.log
2021-04-09T10:07:13.5769569Z [e2e]$ generate_context_flags
2021-04-09T10:07:13.5783287Z [e2e]$ generate_context_flags
2021-04-09T10:07:13.5795957Z [e2e]$ [cluster1] printf  -dp-context cluster1
2021-04-09T10:07:13.5807342Z [e2e]$ [cluster2] printf  -dp-context cluster2
2021-04-09T10:07:13.5818055Z [e2e]$ [cluster3] printf  -dp-context cluster3
2021-04-09T10:08:31.3962901Z === RUN   TestE2E
2021-04-09T10:08:31.4031540Z Running Suite: Submariner E2E suite
2021-04-09T10:08:31.4037156Z ===================================
2021-04-09T10:08:31.4039199Z Random Seed: 1617962911 - Will randomize all specs
2021-04-09T10:08:31.4040041Z Will run 15 of 15 specs
2021-04-09T10:08:31.4040353Z 
2021-04-09T10:08:31.4061802Z STEP: Creating kubernetes clients
2021-04-09T10:08:31.4745593Z STEP: Creating lighthouse clients
2021-04-09T10:08:31.4938688Z [discovery] Test Service Discovery Across Clusters when a pod tries to resolve a service in a specific remote cluster by its cluster name 
2021-04-09T10:08:31.4940034Z   should resolve the service on the specified cluster
2021-04-09T10:08:31.4941170Z   /go/src/github.com/submariner-io/lighthouse/test/e2e/discovery/service_discovery.go:75
2021-04-09T10:08:31.4942264Z STEP: Creating namespace objects with basename "discovery"
2021-04-09T10:08:31.5035065Z STEP: Generated namespace "e2e-tests-discovery-splzk" in cluster "cluster1" to execute the tests in
2021-04-09T10:08:31.5036467Z STEP: Creating namespace "e2e-tests-discovery-splzk" in cluster "cluster2"
2021-04-09T10:08:31.5276311Z STEP: Creating namespace "e2e-tests-discovery-splzk" in cluster "cluster3"
2021-04-09T10:08:31.6137826Z STEP: Creating an Nginx Deployment on "cluster1"
2021-04-09T10:08:36.7363539Z STEP: Creating a Nginx Service on "cluster1"
2021-04-09T10:08:36.7701456Z STEP: Creating serviceExport nginx-demo.e2e-tests-discovery-splzk on "cluster1"
2021-04-09T10:08:36.8030588Z STEP: Creating an Nginx Deployment on "cluster2"
2021-04-09T10:08:41.8156114Z STEP: Creating a Nginx Service on "cluster2"
2021-04-09T10:08:41.8281696Z STEP: Creating serviceExport nginx-demo.e2e-tests-discovery-splzk on "cluster2"
2021-04-09T10:08:41.8841811Z STEP: Retrieving ServiceExport nginx-demo.e2e-tests-discovery-splzk on "cluster2"
2021-04-09T10:11:51.8995875Z STEP: Deleting namespace "e2e-tests-discovery-splzk" on cluster "cluster1"
2021-04-09T10:11:51.9242669Z STEP: Deleting namespace "e2e-tests-discovery-splzk" on cluster "cluster2"
2021-04-09T10:11:51.9307508Z STEP: Deleting namespace "e2e-tests-discovery-splzk" on cluster "cluster3"
2021-04-09T10:11:51.9530563Z STEP: Retrieving EndpointSlices for "" in ns "e2e-tests-discovery-splzk" on "cluster2"
2021-04-09T10:11:51.9589337Z STEP: Retrieving EndpointSlices for "" in ns "e2e-tests-discovery-splzk" on "cluster1"
2021-04-09T10:11:51.9733184Z 
2021-04-09T10:11:51.9769178Z • Failure [200.479 seconds]
2021-04-09T10:11:51.9769861Z [discovery] Test Service Discovery Across Clusters
2021-04-09T10:11:51.9771580Z /go/src/github.com/submariner-io/lighthouse/test/e2e/discovery/service_discovery.go:40
2021-04-09T10:11:51.9772710Z   when a pod tries to resolve a service in a specific remote cluster by its cluster name
2021-04-09T10:11:51.9773922Z   /go/src/github.com/submariner-io/lighthouse/test/e2e/discovery/service_discovery.go:74
2021-04-09T10:11:51.9775008Z     should resolve the service on the specified cluster [It]
2021-04-09T10:11:51.9776101Z     /go/src/github.com/submariner-io/lighthouse/test/e2e/discovery/service_discovery.go:75
2021-04-09T10:11:51.9776691Z 
2021-04-09T10:11:51.9777515Z     Failed to retrieve ServiceExport. No ServiceExportConditions
2021-04-09T10:11:51.9778253Z     Unexpected error:
2021-04-09T10:11:51.9778825Z         <*errors.errorString | 0xc00039c0f0>: {
2021-04-09T10:11:51.9779434Z             s: "timed out waiting for the condition",
2021-04-09T10:11:51.9779855Z         }
2021-04-09T10:11:51.9780281Z         timed out waiting for the condition
2021-04-09T10:11:51.9780889Z     occurred
2021-04-09T10:11:51.9781148Z 
2021-04-09T10:11:51.9783221Z     /go/src/github.com/submariner-io/lighthouse/vendor/github.com/submariner-io/shipyard/test/e2e/framework/framework.go:488
2021-04-09T10:11:51.9783986Z 
2021-04-09T10:11:51.9784612Z     Full Stack Trace
2021-04-09T10:11:51.9785966Z     github.com/submariner-io/shipyard/test/e2e/framework.AwaitUntil(0x1553d7c, 0x16, 0xc000521098, 0x15e3408, 0x0, 0xc00069e370)
2021-04-09T10:11:51.9788058Z        /go/src/github.com/submariner-io/lighthouse/vendor/github.com/submariner-io/shipyard/test/e2e/framework/framework.go:488 +0x1c6
2021-04-09T10:11:51.9789970Z     github.com/submariner-io/lighthouse/test/e2e/framework.(*Framework).AwaitServiceExportedStatusCondition(0xc00011edc8, 0x1, 0xc0006a0740, 0xa, 0xc000695800, 0x19)
2021-04-09T10:11:51.9791806Z        /go/src/github.com/submariner-io/lighthouse/test/e2e/framework/framework.go:128 +0x25e
2021-04-09T10:11:51.9793638Z     github.com/submariner-io/lighthouse/test/e2e/discovery.RunServiceDiscoveryClusterNameTest(0xc00011edc8)
2021-04-09T10:11:51.9795501Z        /go/src/github.com/submariner-io/lighthouse/test/e2e/discovery/service_discovery.go:371 +0x490
2021-04-09T10:11:51.9796763Z     github.com/submariner-io/lighthouse/test/e2e/discovery.glob..func2.6.1()
2021-04-09T10:11:51.9798019Z        /go/src/github.com/submariner-io/lighthouse/test/e2e/discovery/service_discovery.go:76 +0x2a
2021-04-09T10:11:51.9799223Z     github.com/submariner-io/shipyard/test/e2e.RunE2ETests(0xc000347980, 0xc8d328797b)
2021-04-09T10:11:51.9800493Z        /go/src/github.com/submariner-io/lighthouse/vendor/github.com/submariner-io/shipyard/test/e2e/e2e.go:92 +0x125
2021-04-09T10:11:51.9801884Z     github.com/submariner-io/lighthouse/test/e2e.TestE2E(0xc000347980)
2021-04-09T10:11:51.9802976Z        /go/src/github.com/submariner-io/lighthouse/test/e2e/e2e_test.go:26 +0x2b
2021-04-09T10:11:51.9803709Z     testing.tRunner(0xc000347980, 0x15e33e0)
2021-04-09T10:11:51.9804295Z        /usr/lib/golang/src/testing/testing.go:1123 +0xef
2021-04-09T10:11:51.9804827Z     created by testing.(*T).Run
2021-04-09T10:11:51.9805369Z        /usr/lib/golang/src/testing/testing.go:1168 +0x2b3

https://pastebin.com/5Tbf5Xb9

Environment:

Lighthouse CI

dfarrell07 commented 3 years ago

It seems this is actually exposing a deeper issue, as without this PR the Lighthouse jobs actually run with subctl, not Helm.

tpantelis commented 3 years ago

The problem is that the helm jobs aren't deploying the LH components. Looking the helm install command executed by the jobs:

[lighthouse]$ [cluster2] helm --kube-context cluster2 install submariner-operator submariner-latest/submariner-operator --create-namespace --namespace submariner-operator ... -set broker.globalnet=false --set submariner.serviceDiscovery=false --set submariner.cableDriver=libreswan --set submariner.clusterId=cluster2 --set submariner.clusterCidr=10.2.0.0/16 --set submariner.serviceCidr=100.2.0.0/16 --set submariner.globalCidr= --set serviceAccounts.globalnet.create=false --set serviceAccounts.lighthouseAgent.create=false --set serviceAccounts.lighthouseCoreDns.create=false ... --set submariner.serviceDiscovery=true,lighthouse.image.repository=localhost:5000/lighthouse-agent,lighthouse.image.tag=local,lighthouseCoredns.image.repository=localhost:5000/lighthouse-coredns,lighthouseCoredns.image.tag=local,serviceAccounts.lighthouse.create=true

we see that submariner.serviceDiscovery is first set to false then to true. Also the LH service account create flags are set to false (serviceAccounts.lighthouse.create is true but it's invalid). The problem is that the deploy_helm lib in shipyard uses ${service_discovery} parsed from the command line to set these params but the LH Makefile doesn't pass it. Instead it sets submariner.serviceDiscovery=true via --deploytool_submariner_args but it doesn't set the correct **serviceAccounts.*** flags. The Makefile should pass --service_discovery to the shipyard script.