RFE: Airgapped RKE2 etcd to AWS S3 via proxy

HoustonDad commented 8 months ago

Request description:

We are trying to backup our ETCD cluster to AWS S3 from within out network. To do so requires that action to go through a proxy.

While the Rancher-Backup chart has the ability for proxy/noproxy configuration for the deployment only, there does not appear to be a way to do that with the ETCD backup's only as they is at RKE2's level and not a specific backup deployment/job/container.

Adding the noproxy/proxy to the RKE2 level would not be an option as kubelet would then be able to talk externally and that is restricted in our clusters.

Actual behavior: Grant kubelet access via proxy to get S3 backups placed offsite

Expected behavior: Grant etcd backup process access to proxy to push backups offsite

Workaround: None

HoustonDad commented 8 months ago

This maps back to SURE-6226

brandond commented 8 months ago

It sounds like you are asking for explicit proxy support for the S3 client? Separate from setting the HTTP proxy env vars for the whole server process?

Are you not able to set proxy ACLs for the cluster nodes, such that they are only allowed to access S3? This is a pretty common pattern when using a proxy to restrict outbound access. You'll need to do this anyway even if we do add this feature, as I don't think you'd want to simply trust that the nodes are only using the proxy for S3 and not other things?

zlmitchell commented 8 months ago

It sounds like you are asking for explicit proxy support for the S3 client? Separate from setting the HTTP proxy env vars for the whole server process?

This is accurate, Allowing the whole OS/RKE2 stack and everything therein access is not permitted.

Are you not able to set proxy ACLs for the cluster nodes, such that they are only allowed to access S3? This is a pretty common pattern when using a proxy to restrict outbound access. You'll need to do this anyway even if we do add this feature, as I don't think you'd want to simply trust that the nodes are only using the proxy for S3 and not other things?

I can see why you would think toss this issue up the OSI layers, but Rancher Backup containers allow this, ETCD backups are done at the OS level which is a problem. The more specific we can be the better when i comes to Proxy/External Activity not host, but container/application.

The ETCD backup should probably be done in a similar way to Rancher's Backup Container. This allows the single container to be passed through a defined proxy and has the benefit that it is definable using GitOps/Vault auth vs plain text on the OS filesystem. Also don't get me started on the ETCD backup auth in plain text in the config.yaml.... And editing it on Rancher provisioned systems.

brandond commented 8 months ago

The ETCD backup should probably be done in a similar way to Rancher's Backup Container.

You're aware that etcd backups are of the actual cluster datastore, at a lower level than the apiserver, right? etcd backup and restore need to happen outside the context of a running cluster, and cannot be done from within a pod or other construct that relies on the cluster being operational. We have to do them on the host itself.

The rancher backup operator backs up cluster resources, and needs a functioning Kubernetes apiserver to operate against. They operate at very different levels.

Also don't get me started on the ETCD backup auth in plain text in the config.yaml

If you don't have ambient credentials (ie IAM role from the instance metadata service), credentials need to be passed to RKE2 in order to access the S3 endpoint. Creds need to be stored somewhere; where would you suggest that we keep them?

At the end of the day your steps are going to be the same:

Allow your RKE2 node access through the proxy, ensuring that the proxy ACLs allow only access the S3 endpoint
Configure HTTP proxy for RKE2

If you configure HTTP_PROXY and NO_PROXY correctly within Rancher's "agent args" settings for the cluster, both rancher-system-agent and RKE2 should be able to save snapshots through the proxy, without using it for anything else, and without waiting on us to add any one-off args that enable special behavior for the snapshot process.

zlmitchell commented 8 months ago

@brandond You okay? What you wrote seems to have some unfriendly vibes or unwillingness to move a new idea forward.

Primarily our issue is with rancher provisioned downstream clusters. As it stands right now RKE2 will perform ETCD backups at the OS level. This is a requirement. You configure the backup interval via the config.yaml. When a backup happens RKE2 creates a configmap manifest and drops it into the /var/lib/rancher/rke2/server/manifest directory recording that backup to Rancher itself.

This process is used for manual snapshots and restores as well. The etcd backups are handled via the manifest being generated after a successful backup.
Restores kick off a command to the rancher-system-agent to perform the operation based on where the file was located in the etcdsnapshot manifest (which was generated by the rke2-etcd-snapshots configmap)

At this point the export to S3 isn’t required to be completed at the OS level. Rancher has the information on the etcd backup and where it exists on the host.

Now the question come into place. How to move those backups to S3 and allow rancher the knowledge that they exist. Should the rancher-system-agent run a command independently with proxy settings to ship the file to S3 or pull one down, or does Rancher run a process on the node to do the same in a container. Also, could rancher command the rancher-system-agent to pull down backups from S3 or other location and download it to the host for restore doing the same procedure? As the rancher-system-agent is outside the cluster it could handle this, or the push to/pull from s3 could be handled as a container next to Rancher MCM or on the particular cluster with HTTP_PROXY settings.

We should not have to do any configuration at the OS level. Rancher does that for us. This eliminated the need for rolling the control planes when configuration updates are required. The secret for the S3 can then be kept in Rancher MCM or Vault or other Secret Management system. This will also allow you to backup and restore from different locations in S3 (regionally separated backups).

brandond commented 8 months ago

It sounds like you have a lot of thoughts on how Rancher should be managing downstream cluster nodes, and in particular how the agent could better manage distribution of credentials for, and access to, the configured S3 endpoint.

I'm going to move this to the rancher/rancher repo. If it is picked up there, and they determine that there is additional work necessary on the RKE2 side, we can figure out what that might look like.

zlmitchell commented 8 months ago

@brandond Sorry if my example on how the functionality could be addressed for specifically Rancher Provisioned Systems gave you the feeling that you could pass this feature request to another team, but the issue still exists for native RKE2 servers with no Rancher Management.

This should be tackled in RKE2 first or entirely to ensure compatibility with Rancher after. The fact of the S3 process being defined directly in the RKE2 binary is an issue. It could instead be handled using the post process in RKE2 when it generates the rke2-etcd-snapshot configmap. At that time there could be an operator or something in RKE2 that monitors the snapshot and provides the S3 upload/download capabilities. This will still allow the original functionality to be available for DR purposes.

The current solution of the RKE2 binary being allowed to restore S3 snapshots and force push them to S3 is fine, but the secret management and flexibility is still not what our systems require.

github-actions[bot] commented 6 months ago

This repository uses an automated workflow to automatically label issues which have not had any activity (commit/comment/label) for 60 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the workflow can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the workflow will automatically close the issue in 14 days. Thank you for your contributions.

HoustonDad commented 6 months ago

Not stale. Please re-open

brandond commented 6 months ago

Its not closed... but I can unstale it.

brandond commented 6 months ago

The K3s ADR supporting this (https://github.com/k3s-io/k3s/pull/9364) has been open for a while; we need someone from Hostbusters to take a look and sign off on it.

github-actions[bot] commented 4 months ago

This repository uses an automated workflow to automatically label issues which have not had any activity (commit/comment/label) for 60 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the workflow can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the workflow will automatically close the issue in 14 days. Thank you for your contributions.

brandond commented 4 months ago

Will create k3s/rke2 issues to track the distro side of this. Rancher can add support once it's implemented in the distros.

github-actions[bot] commented 2 months ago

This repository uses an automated workflow to automatically label issues which have not had any activity (commit/comment/label) for 60 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the workflow can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the workflow will automatically close the issue in 14 days. Thank you for your contributions.

zlmitchell commented 2 months ago

Not stale, waiting implementation

brandond commented 2 months ago

See above linked issues for distro implementation. I'll leave this open pending integration on the rancher side.

HoustonDad commented 1 month ago

Do we want to use this GH issue for the Rancher side implementation or do we want to break it out into its own issue, tracked back to here?

rancher / rancher

RFE: Airgapped RKE2 etcd to AWS S3 via proxy #44256