Support for non-EKS IRSA

thefirstofthe300 commented 2 years ago

/kind feature

IRSA appears to have been complete for EKS clusters in #2054. My company currently uses non-EKS clusters in combination with IRSA. I'd like the ability to install IRSA using the provider. The main piece that is tricky to coordinate is the certificate authority used to sign and validate the JWTs. If these pieces could be automated as part of the ignition config (we also use Flatcar), our burden of installation/maintenance would be greatly decreased.

Environment:

Cluster-api-provider-aws version: v1.4
Kubernetes version: (use kubectl version): v1.24
OS (e.g. from /etc/os-release): Flatcar Container Linux by Kinvolk 3139.2.3 (Oklo)

thefirstofthe300 commented 2 years ago

I managed to POC using the built-in well-known OID configuration endpoint in the API server (https://github.com/kubernetes/kubernetes/pull/98553); however, it currently requires bringing my own infra and knowing the CA cert's fingerprint. Here's my configuration:

BYO load balancer listening on port 443 and forwarding traffic to the API server
BYO load balancer security group with the proper configuration to allow traffic to the ELB on port 443
Add a cluster-role allowing unauthenticated users to fetch the /.well-known/openid-configuration endpoint kubectl create clusterrolebinding oidc-reviewer --clusterrole=system:service-account-issuer-discovery --group=system:unauthenticate
Create the OIDC provider with the CA cert's fingerprint

Unfortunately, I don't really have time to take a stab at contributing this feature to CAPA right now or I'd try to add this to CAPA myself.

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

thefirstofthe300 commented 1 year ago

/remove-lifecycle stale

luthermonson commented 1 year ago

I just had to spec this out and will code it into CAPA shortly. I'll add my approach and I would like some more eyes on it before I spend too much time coding. First, I'm basing more of my knowledge on how to even get this working on an example written by @smalltown located here: https://github.com/smalltown/aws-irsa-example and accompanying blog article: https://medium.com/codex/irsa-implementation-in-kops-managed-kubernetes-cluster-18cef84960b6

Approach 1) Users will have to create an iaas cluster with bucket management turned on, this is a feature used in capa to manage an s3 bucket for ignition userdata and we will reuse this existing functionality to create/delete a bucket per cluster. 2) Make associateOidcProvider: true work for iaas clusters, it will use cert manager to gen certs and add keys.json and .well-know/openid-configuration and make them public read and then create the oidc provider in IAM and generate a trust policy and store it in a config map which you can use later in your roles which is idtentical to how managed clusters work. 3) Install https://github.com/aws/amazon-eks-pod-identity-webhook by taking the manifests and moving them into ./config, modify it to use a cert from cert-manager and have capa deploy it to the cluster. 4) As per expectations in EKS' implementation, capa will not build the irsa roles for you. The user is expected to use the trust policy located in a configmap and do this themselves. The user will create the role with the trust policies they need for their service and add the contents from this configmap to make a working role. The user is also expected to take the arn from that role and use it as an annotation eks.amazonaws.com/role-arn when they make a service account and assign it to a pod.

@thefirstofthe300 please read the above and see if you agree with the approach, while I like your idea of using the ServiceAccountIssuerDiscovery but I feel having capa spin up any extra infra is a bit too much to ask of users just to get IRSA support when the s3 buckets are already being used/managed by capa and have free public URLs. The only pieces capa would need to manage are some certs, iam oidc provider, webhook and the contents of the files in s3.

thefirstofthe300 commented 1 year ago

Either way will work. I think the main question comes down to how certs should work.

In essence, the way I went about creating things is identical to the way CAPA creates it's infrastructure; however, instead of generating everything using port 6443 for the API server ELB serving port, it uses port 443. I'd think providing the ability to set the serving port to 443 over 6443 would be relatively simple in the CAPA code and remove the need for the BYO infra.

The two things I've had to shove into my configs are adding the root ca cert to the API serving cert chain using Ignition and adding the cluster binding to allow unauthenticated users to view the openid config.

Doing this removes the need to rely on S3 to serve the configuration.

The main piece that I have a question on is whether people want to serve the root cert with their serving chain. Ultimately, I don't think it opens any security holes as the certificates themselves are basically a public key and no change is made to the TLS private key.

I'd be interested in hearing some input from someone more in tune with the security side.

dlipovetsky commented 1 year ago

How much of this can we solve with documenation vs. code and cluster/infrastructure component template changes?

/triage needs-information

luthermonson commented 1 year ago

@dlipovetsky fair question but all the pieces together make this a complex feature as it mixes aws resources (s3, oidc provider) and kube resources (cert, deployment) and the one thing which knows about all those things is capa and can only be built at cluster creation time. When I get this written we will have feature parity between iaas and eks for IRSA and i feel it's only fair that capa does the work to make feature parity between the two cluster types

Skarlso commented 1 year ago

@dlipovetsky I think I agree with Luther. I don't believe a document change can handle this. The IRSA logic requires a fair amount of work between the pods, assigning things, creating connections, adding the webhook, generating certificates etc.

Honestly, this is quite a large piece of work. @luthermonson and @thefirstofthe300, do you think we could break this down into several more minor issues and use this issue as an epic to track each piece?

Such as creating the configuration, supporting 443, adding the webhook installation process, etc.

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Skarlso commented 1 year ago

/remove-lifecycle stale

Skarlso commented 1 year ago

/triage accepted

k8s-triage-robot commented 6 months ago

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

k8s-ci-robot commented 6 months ago

This issue is currently awaiting triage.

If CAPA/CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

k8s-triage-robot commented 3 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 1 month ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-sigs/cluster-api-provider-aws/issues/3560#issuecomment-2389714676): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

kubernetes-sigs / cluster-api-provider-aws

Support for non-EKS IRSA #3560