Closed dfarrell07 closed 3 years ago
It seems this is actually exposing a deeper issue, as without this PR the Lighthouse jobs actually run with subctl, not Helm.
The problem is that the helm jobs aren't deploying the LH components. Looking the helm install command executed by the jobs:
[lighthouse]$ [cluster2] helm --kube-context cluster2 install submariner-operator submariner-latest/submariner-operator --create-namespace --namespace submariner-operator ... -set broker.globalnet=false --set submariner.serviceDiscovery=false --set submariner.cableDriver=libreswan --set submariner.clusterId=cluster2 --set submariner.clusterCidr=10.2.0.0/16 --set submariner.serviceCidr=100.2.0.0/16 --set submariner.globalCidr= --set serviceAccounts.globalnet.create=false --set serviceAccounts.lighthouseAgent.create=false --set serviceAccounts.lighthouseCoreDns.create=false ... --set submariner.serviceDiscovery=true,lighthouse.image.repository=localhost:5000/lighthouse-agent,lighthouse.image.tag=local,lighthouseCoredns.image.repository=localhost:5000/lighthouse-coredns,lighthouseCoredns.image.tag=local,serviceAccounts.lighthouse.create=true
we see that submariner.serviceDiscovery is first set to false then to true. Also the LH service account create flags are set to false (serviceAccounts.lighthouse.create is true but it's invalid). The problem is that the deploy_helm lib in shipyard uses ${service_discovery} parsed from the command line to set these params but the LH Makefile doesn't pass it. Instead it sets submariner.serviceDiscovery=true via --deploytool_submariner_args
but it doesn't set the correct **serviceAccounts.*** flags. The Makefile should pass --service_discovery
to the shipyard script.
What happened:
It seems like the most recently merged PR broke the Lighthouse+Helm jobs.
In the flake finder, the jobs from 5 days and before were all passing:
https://github.com/submariner-io/lighthouse/actions/workflows/flake_finder.yml
https://github.com/submariner-io/lighthouse/actions/runs/731556462
The jobs 4 days and more recently are all failing in the same way:
https://github.com/submariner-io/lighthouse/actions/runs/734764600
In the PR-triggered E2E, the PR before the one in question passed:
https://github.com/submariner-io/lighthouse/pull/501
The PR-triggered E2E on the PR in question failed, but the PR was merged:
https://github.com/submariner-io/lighthouse/pull/502
Reverting the PR fixes the Helm jobs:
https://github.com/dfarrell07/lighthouse/pull/1
As compared to PRs with the same base run at about the same time, where the Helm jobs fail:
https://github.com/submariner-io/lighthouse/pull/503
There were no PRs merged to the Helm repo in the relevant timeframe.
From what I can see of the logs, the nginx connectivity tests pass and the first E2E test fails.
https://pastebin.com/5Tbf5Xb9
Environment:
Lighthouse CI