Closed nprokopic closed 1 year ago
Here is difference between 2 apps, one deployed properly (via Flux), and another which is not deployed properly (via opsctl):
apiVersion: application.giantswarm.io/v1alpha1
kind: App
metadata:
annotations:
app-operator.giantswarm.io/latest-configmap-version: "79866114"
app-operator.giantswarm.io/latest-secret-version: "79866118"
chart-operator.giantswarm.io/force-helm-upgrade: "true"
config.giantswarm.io/version: main
creationTimestamp: "2022-09-15T14:43:11Z"
finalizers:
- operatorkit.giantswarm.io/app-operator-app
generation: 22
labels:
app-operator.giantswarm.io/version: 0.0.0
app.kubernetes.io/name: aws-network-topology-operator
giantswarm.io/managed-by: konfigure
kustomize.toolkit.fluxcd.io/name: collection
kustomize.toolkit.fluxcd.io/namespace: flux-giantswarm
name: aws-network-topology-operator
namespace: giantswarm
resourceVersion: "79866441"
uid: 2fe4f51e-f86f-44ca-ba00-defe943bcd30
spec:
catalog: control-plane-catalog
config:
configMap:
name: aws-network-topology-operator-konfigure
namespace: giantswarm
secret:
name: aws-network-topology-operator-konfigure
namespace: giantswarm
install: {}
kubeConfig:
context:
name: ""
inCluster: true
secret:
name: ""
namespace: ""
name: aws-network-topology-operator
namespace: giantswarm
namespaceConfig: {}
userConfig:
configMap:
name: ""
namespace: ""
secret:
name: ""
namespace: ""
version: 1.2.0
status:
appVersion: 1.2.0
release:
lastDeployed: "2022-10-24T14:08:07Z"
status: deployed
version: 1.2.0
apiVersion: application.giantswarm.io/v1alpha1
kind: App
metadata:
annotations:
chart-operator.giantswarm.io/force-helm-upgrade: "true"
creationTimestamp: "2022-10-26T13:55:59Z"
finalizers:
- operatorkit.giantswarm.io/app-operator-app
generation: 3
labels:
app-operator.giantswarm.io/version: 0.0.0
app.kubernetes.io/name: aws-vpc-operator
name: aws-vpc-operator
namespace: giantswarm
resourceVersion: "82992461"
uid: 79cbeaa2-1881-4aae-8238-16f5835100e2
spec:
catalog: control-plane-test-catalog
config:
configMap:
name: ""
namespace: ""
secret:
name: ""
namespace: ""
install: {}
kubeConfig:
context:
name: ""
inCluster: true
secret:
name: ""
namespace: ""
name: aws-vpc-operator
namespace: giantswarm
namespaceConfig: {}
userConfig:
configMap:
name: ""
namespace: ""
secret:
name: ""
namespace: ""
version: 0.0.0-1830810adf47b07ffba20d4594e30f745b3fac38
status:
appVersion: 0.0.0-1830810adf47b07ffba20d4594e30f745b3fac38
release:
lastDeployed: "2022-10-27T07:37:24Z"
status: deployed
version: 0.0.0-1830810adf47b07ffba20d4594e30f745b3fac38
Added this to SIG product board, as it is not clear which team should own and tackle this problem. cc @giantswarm/sig-product
SIG Product notes reflect that this has moved to honeybadger for now
@piontec as agreed, please have a look at this in prep for tomorrow's planning, trying to have an idea about effort/impact of this
Impact is pretty high as it makes it harder for people to test stuff on CAPI MCs. As for the effort - let's check with the team in planning
@uvegla
@nprokopic We are ~95% percent sure this is not an issue with the deploy command but the problem originates from that some login / kubeconfig mechanism is different for CAPI cluster (error messages goes down to kubeconfig part of opsctl saying: Check whether kubeconfig is already set up for the installation
), so not sure it is a Hone Badger issue. I will take a look tho and collect proofs / try to fix it but might gonna end up pulling in someone from KAAS area about the login / kubeconfig issue.
@nprokopic The problem is in the K8s client indeed. Got a trace for the root cause here: https://github.com/giantswarm/opsctl/pull/1563#issuecomment-1308900411 Does it ring any bells to you? I will keep looking for a solution but help is appreciated cos I do not know how the client setups towards CAPx should work and in the trace there is a specific case that does this logic only for "PureCAPIProviders" (see and the trace above that enforces the SSO boot in that case)
Update: tried to simply set the API Endpoint at the very beginning of the boot with SSO logic. That progresses things a bit, but then bumps into issues with the ca.pem needs to exist locally, a logic that the boot with key pairs logic already has. But simply copying that part over does not help. That is where I am ATM.
Doesn't the deploy
command usually create a client certificate for the MC we want to deploy to?
In this case, it will fail because client certificate creation as we do via vault is not available on CAPI (at least as far as I know).
If we want to enable the deploy command via SSO we would indeed need the CA. The login
command does this by querying the athena
service running on the MC. It exposes the CA.pem
as well as oidc issuer url to enable SSO.
@anvddriesch @nprokopic Based on Antonia's comment and that our hunch seems to be correct that primarily this is an authentication towards CAPx issue, does it make sense that KAAS / someone with more knowledge on how this work on CAPx takes over? I would also be happy to pair on it to gather some knowledge in this area.
CAPI PKI issue is still ongoing and it appears that no clear solution is there yet https://github.com/giantswarm/giantswarm/issues/15981 If we decide to use SSO here, rainbow is available for assistance.
I think this change by @erkanerol might be related / fix the issue, no?.
My PR can be related to this issue. I didn't continue my PR since I don't know the all use cases of opsctl. I didn't want to break something else while fixing an issue.
Based on the above investigation and https://github.com/giantswarm/giantswarm/issues/15981 we think this is not an issue Honey Badger can solve / unblock at the moment. Authentication towards different providers via SSO - which is hard coded to be default for CAPx providers - needs knowledge / to be solved by respective KAAS teams. Also since with --use-kubeconfig
- the other kind of authentication that skips the SSO authentication - all the other logic parts of the command seem to work. Therefore we decided to move it to watching status on Honey Badger board board and assign it to KAAS teams to solve the authentication with SSO towards CAPx clusters.
short summary update: this was discussed in SIG product, we agreed @giantswarm/team-rainbow will unblock @giantswarm/team-honeybadger regarding authentication cc @uvegla @anvddriesch
@nprokopic I cannot seem to reproduce the issue of not setting the config
section correctly.
This was what I got started with in golem
.
apiVersion: application.giantswarm.io/v1alpha1
kind: App
metadata:
annotations:
app-operator.giantswarm.io/latest-configmap-version: "21063776"
dummy
app-operator.giantswarm.io/latest-secret-version: "21063779"
chart-operator.giantswarm.io/force-helm-upgrade: "true"
Add dummy change to create a new version
config.giantswarm.io/version: main
creationTimestamp: "2022-10-31T07:57:58Z"
finalizers:
- operatorkit.giantswarm.io/app-operator-app
generation: 22
labels:
app-operator.giantswarm.io/version: 0.0.0
app.kubernetes.io/name: aws-vpc-operator
giantswarm.io/managed-by: konfigure
kustomize.toolkit.fluxcd.io/name: collection
kustomize.toolkit.fluxcd.io/namespace: flux-giantswarm
name: aws-vpc-operator
namespace: giantswarm
resourceVersion: "22598942"
uid: 0bbb4f4d-9179-449f-a429-d6913c8fc798
spec:
catalog: control-plane-catalog
config:
configMap:
name: aws-vpc-operator-konfigure
namespace: giantswarm
secret:
name: aws-vpc-operator-konfigure
namespace: giantswarm
install: {}
kubeConfig:
context:
name: ""
inCluster: true
secret:
name: ""
namespace: ""
name: aws-vpc-operator
namespace: giantswarm
namespaceConfig: {}
rollback: {}
uninstall: {}
upgrade: {}
userConfig:
configMap:
name: ""
namespace: ""
secret:
name: ""
namespace: ""
version: 0.1.0-alpha.18
status:
appVersion: 0.1.0-alpha.18
release:
lastDeployed: "2022-11-16T14:35:57Z"
status: deployed
version: 0.1.0-alpha.18
Then ran:
opsctl deploy --use-kubeconfig -i golem aws-vpc-operator@opsctl-test --level=debug
and ended up with:
apiVersion: application.giantswarm.io/v1alpha1
kind: App
metadata:
annotations:
app-operator.giantswarm.io/latest-configmap-version: "21063776"
app-operator.giantswarm.io/latest-secret-version: "21063779"
chart-operator.giantswarm.io/force-helm-upgrade: "true"
config.giantswarm.io/version: main
opsctl.x-giantswarm.io/creator: laszlouveges
opsctl.x-giantswarm.io/previous-app-version: 0.1.0-alpha.18
creationTimestamp: "2022-10-31T07:57:58Z"
finalizers:
- operatorkit.giantswarm.io/app-operator-app
generation: 25
labels:
app-operator.giantswarm.io/version: 0.0.0
app.kubernetes.io/name: aws-vpc-operator
giantswarm.io/managed-by: konfigure
kustomize.toolkit.fluxcd.io/name: collection
kustomize.toolkit.fluxcd.io/namespace: flux-giantswarm
name: aws-vpc-operator
namespace: giantswarm
resourceVersion: "22669009"
uid: 0bbb4f4d-9179-449f-a429-d6913c8fc798
spec:
catalog: control-plane-test-catalog
config:
configMap:
name: aws-vpc-operator-konfigure
namespace: giantswarm
secret:
name: aws-vpc-operator-konfigure
namespace: giantswarm
install: {}
kubeConfig:
context:
name: ""
inCluster: true
secret:
name: ""
namespace: ""
name: aws-vpc-operator
namespace: giantswarm
namespaceConfig: {}
rollback: {}
uninstall: {}
upgrade: {}
userConfig:
configMap:
name: ""
namespace: ""
secret:
name: ""
namespace: ""
version: 0.1.0-alpha.18-7e92fdea215e15465b78516b4846c32e815b6056
status:
appVersion: 0.1.0-alpha.18-7e92fdea215e15465b78516b4846c32e815b6056
release:
lastDeployed: "2022-11-16T15:52:18Z"
status: deployed
version: 0.1.0-alpha.18-7e92fdea215e15465b78516b4846c32e815b6056
Debugging into it, if the App exists already it run this patch: https://github.com/giantswarm/opsctl/blob/v2.24.7/pkg/cmd/deploy/appdeploy/deployer.go#L178-L186, from a copy of the original CR it just changes the version. (Disclaimer: I have the hunch that on this line it should return modifiedAppCR
instead as the 1st argument, but the CR patch already took place I think correctly at that point).
What I am thinking is that the config is empty when opsctl deploy
runs for an app that does not exist. Initialized here as this and created here.
I cannot seem to reproduce the issue of not setting the config section correctly.
App has been pushed to capa-app-collection so Flux has deployed it and wired all required config. The issue should be reproducible with an app that is not already deployed by Flux (so some newly developed operator for example, and we have those often lately). At least that's my understanding based on recent Slack chats.
What I am thinking is that the config is empty when opsctl deploy runs for an app that does not exist. Initialized here as this and created here.
@nprokopic In that case that it proves my hunch. What kind of configmaps do you expect to be added automatically by opsctl when there is no previous app to inherit from? I assume it would be what konfigure generates based on config repo, but I want to double check. In the meantime I will look into it a bit later cos there might be such feature around in opsctl already, but need to check.
UPDATE: So the app deploy does not use konfigure for a reason it seems: https://github.com/giantswarm/opsctl/blob/v2.24.7/command/deploy/command.go#L244-L259 and that deployment method is used always when --use-kubeconfig
is passed, which we want to make the default for now for CAPI (see below). Sadly I have no context why these decisions were made / why it is the way it is, so let me do a bit more digging and see if we can anyway just use the flux method.
UPDATE: FTR the flux deployment does not work ATM because the config.giantswarm.io/version
has to be set in the Chart.yaml
here: https://github.com/giantswarm/aws-vpc-operator/blob/v0.1.0-alpha.19/helm/aws-vpc-operator/Chart.yaml#L15. I move on adding it in my experiment branch for aws-vpc-operator.
UPDATE: Adding the annotation helped to progress, but eventually it really fails with Vault when using --use-kubeconfig
here: https://github.com/giantswarm/opsctl/blob/v2.24.7/pkg/cmd/deploy/fluxdeploy/konfigure.go#L109
UPDATE: Another breadcrumb: Vault fails because it cannot get the vault/ca.pem
file from the installations
repo here: https://github.com/giantswarm/opsctl/blob/v2.24.7/service/vaultfactory/installation_client.go#L87, because when --use-kube-config
is set this happens: https://github.com/giantswarm/opsctl/blob/v2.24.7/command/deploy/command.go#L281-L283 so the installation name is passed around everywhere as an empty string. This is terrible but also semi makes sense cos when kube config is used, you could say it does not matter cos the current context is used, but life finds a way, I guess. But even if if would be fixed somehow, golem
for example does not have a vault/ca.pem
file, so honestly even the Flux method would fail if we get the authentication towards CAPI sorted. See: https://github.com/giantswarm/installations/tree/master/golem
I have been looking into making flux deployment via SSO work.
Unfortunately, this is a lot more complicated than assumed. As @uvegla found earlier, we are already expecting client certificate data to be present when configuring clients. Changing this to accept authprovider configuration would be a bigger restructure and it would be good if rainbow could invest some time there and do it well.
However, that alone would not make things work, either. When creating resources via konfigure
we also expect vault
to be present and if it's not (as in CAPI), that part will fail.
My proposal is that we refine and prioritize refactoring of kubeconfig management for opsctl in rainbow and make it a proper story. In the meantime, we can unblock deploy
by defaulting to --use-kubeconfig
in CAPI installations.
Does that make sense to you?
My proposal is that we refine and prioritize refactoring of kubeconfig management for opsctl in rainbow and make it a proper story.
If you have the time to do it, it would be lovely. We just talked about it how messy / copy-pasted around are those parts in opsctl
and extracting to a single module would totally make sense. I think @kubasobon already attempted to do it once, might have some inputs for you.
We can unblock deploy by defaulting to --use-kubeconfig in CAPI installations.
Works from my perspective. @nprokopic?
@nprokopic sadly I found this (added updates to this comment) but the main part is:
golem for example does not have a vault/ca.pem file, so honestly even the Flux method would fail if we get the authentication towards CAPI sorted. See: https://github.com/giantswarm/installations/tree/master/golem
UPDATE: There is no vault in golem
. All CAPx cluster will be eventually vaultless.
To my knowledge last state in https://github.com/giantswarm/giantswarm/issues/15981 was that we will need Vault going further for certificate mgmt, even if that's a vault that's within the MC.
Still, for our own purposes I would prefer if we could go full OIDC SSO and have bot accounts for automations.
From the above I see you are looking for Vault or at least the CA. I'm not sure what we need Vault or the CA for and why we cannot re-use opsctl login
functionality here:
opsctl login golem
Logging in to the management cluster of installation golem
Executing the command
kubectl gs login https://api.golem.gaws.gigantic.io --cluster-admin
Your browser should now be opening this URL:
https://dex.golem.gaws.gigantic.io/auth?access_type=offline&client_id=zQiFLUnrTFQwrybYzeY53hWWfhOKWRAU&redirect_uri=http%3A%2F%2Flocalhost%3A63564%2Foauth%2Fcallback&response_type=code&scope=openid+profile+email+groups+offline_access+audience%3Aserver%3Aclient_id%3Adex-k8s-authenticator&state=QWHhuZZPv%2B7Sojoy07agBnNLcUH1GuqQtPOE7K6GKSM%3D&connector_id=giantswarm
Logged in successfully as 'puja108' on cluster 'golem'.
A new kubectl context 'gs-golem' has been created and selected. To switch back to this context later, use either of these commands:
kubectl gs login golem
kubectl config use-context gs-golem
With the above I get full access to golem, right? Sure it's using kgs
, but why can opsctl deploy
not re-use that? worst case: make opsctl commands optionally accept the current (or a specific local) kubectl context, we even have standardized context names so choosing of the installation should work.
Maybe I'm mistaken and there's some other complexity?
Sidenote: remember opsctl deploy
(like ensure
and some other opsctl commands) is sth we should not keep mid-term, it's more or less a workaround for testing branches, which should move to pipelines, gitops, automated dev setups,...
See: https://github.com/giantswarm/opsctl/pull/1578
Added SOPS support to the konfigure
module. It is based on a dummy check whether INSTALLATION/vault/ca.pem
exists in the installations repository as we do not have a better one right now.
Also added a new argument that can enforce which deployer to use.
Moved to watching to monitor it for a while after a fix.
The latest state where we can use the combination of --use-kubeconfig
for auth and -m flux
to enforce the usage of config is great, thanks!
opsctl deploy
is not working on CAPI MCs.Default way of deploying
This happens when trying to deploy an app:
Deploying with
--use-kubeconfig
flagDeploying an app with
--use-kubeconfig
seems to be working, but it actually doesn't as the app is not deployed properly (see below):App CR is not correct, as
spec.config
is emptyFor example, in a operator deployed properly by Flux,
spec.config
looks like this:The
aws-vpc-operator
app that I was trying to deploy here is new and it's still not a part ofcapa-app-collection
, so not sure if that makes any difference.