Supporting managed clusters without direct network connectivity

randomvariable commented 2 years ago

User Story

As a designer of a Platform as a Service for multiple customers, I want to be able to provision managed clusters (e.g. EKS, AKS) without L2/L3 connectivity into the customer network.

Detailed Description

There are cases where someone wants to run a common cluster as a service offering, but as a service provider supporting multiple customers. Compliance and SecOps requirements mean that direct L2 or L3 connectivity between the provisioning network containing the management cluster and the workload cluster is not available.

However, the management cluster does have access to the cloud provider APIs like AWS and Azure that allow the provisioning of the managed cluster offering.

Cluster API today currently requires connectivity directly to the Kubernetes API Server endpoint as it is directly involved in the lifecycle management of individual VMs forming the cluster. For a managed offering, Cluster API is only indirectly managing the lifecycle. However, having a common API in the form of Cluster API to represent both managed and unmanaged clusters are still helpful.

It could be possible then to "take it upon trust" that the managed cluster provider is doing the right thing, is able to do LCM properly and is able to report cluster health correctly.

This can be summarised in the following requirements:

Cluster API relaxes the requirement to connect directly to the API server endpoint IFF the managed cloud provider offers suitable status information and LCM operations, including that of node fleet management.
Cluster API is still able to fetch a kubeconfig from the cloud provider
Cluster API COULD have some mechanism to deploy some sort of workload on behalf of the service provider much like CRS does today - this would be implemented in a specialised way at the cloud provider level but may have a generic interface.

Anything else you would like to add:

/kind feature

cc @yastij @berndtj

fabriziopandini commented 2 years ago

/milestone v1.2 Thanks for the issue, the use case is super interesting, we should start to unfold the different requirements one by one

vincepri commented 2 years ago

A few additional thoughts:

We need to understand how we'd communicate to users that certain features cannot be linked to clusters without connectivity to the API server.
This proposal goes to a different direction than the MachinePoolMachine one.

enxebre commented 2 years ago

@randomvariable thanks for putting this together!

There are cases where someone wants to run a common cluster as a service offering, but as a service provider supporting multiple customers. Compliance and SecOps requirements mean that direct L2 or L3 connectivity between the provisioning network containing the management cluster and the workload cluster is not available. However, the management cluster does have access to the cloud provider APIs like AWS and Azure that allow the provisioning of the managed cluster offering.

In this use case are you assuming the control plane are individual machines running within the workload cluster infrastructure? I have a similar use case for "common cluster as a service offering" where we decouple (at the infra and networking level) the control plane (hosted as pods running in the management cluster) and the data plane (compute nodes running within cluster consumer infrastructure). Controllers for each hosted cluster (including CAPI) live in each hosted cluster particular namespace. Network policies restrict traffic targeted to each hosted cluster API server to its namespace. Control plane to cluster communication is handled by konnectivity

Just more food for thought as we work towards a proposal.

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

fabriziopandini commented 2 years ago

@jackfrancis and @richardcase if we want to consider this in the managed Kubernetes discussion

richardcase commented 2 years ago

We have the scenario where we need to be able to support a workload cluster initiated connection as opposed to the current management cluster initiated connection. And this isn't specifically for managed k8s clusters, but a general principle for all clusters created. We see this requirement a lot with customers.

This is something that we need, so i will start looking into this:

/assign

I'll start gathering ideas into a proposal doc.

richardcase commented 2 years ago

@randomvariable started a doc to add notes/thoughts to: https://docs.google.com/document/d/1j4sCPGO_0e1G-IyiI_8s98R3RVrYsgY9n0VFcde3ELo/edit#heading=h.lcjlkg7scook

fabriziopandini commented 2 years ago

/triage accepted cc @fgutmann

fgutmann commented 1 year ago

Like discussed today in the office hours, we want to form a feature group for a communication pattern between WL clusters and CAPI, so that the workload cluster and CAPI can be in separate networks.

From what I see there are multiple different use-cases brought up in this issue and the discussions around it:

A solution for managed clusters only, which does not require API-server connectivity from CAPI at all.
The "common cluster as a service offering" scenario where only the workload clusters' worker nodes are in a different network.
The scenario where a management cluster manages workload clusters (including control plane nodes), which are in a different network than the management cluster.

Are there any other use-cases that we want to consider?

Which use-cases should be part of the scope of the new feature group? Is number 1 above covered in the managed kubernetes feature group already?

richardcase commented 1 year ago

I am interested in use case 3.

fgutmann commented 1 year ago

Also primarily interested in use case 3.

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

richardcase commented 1 year ago

/remove-lifecycle rotten

richardcase commented 1 year ago

@fgutmann & anyone else interested, I have started the process of creating a feature group around this, see #7902.

jelmersnoeck commented 1 year ago

I am also interested, primarily use case 3.

fgutmann commented 1 year ago

I created a document to work on a proposal / present research findings. Every member of the kubernets-sig-cluster-lifecycle google group should be able to edit it.

CAPI - Managing Clusters in Disjoint Networks - Proposal (Google Docs)

From the conversation in this issue and the meeting of the alternative communication patterns feature group earlier today, it seems that most people are interested in the scenario of a management cluster that manages workload clusters, which are in a different network. For that reason I was thinking to keep the document specific to this specific case. We can tackle other use-cases as outlined in the comment above separately.

My plan is to fill in some user-stories and high level ideas / research results over the course of the next one or two weeks. Feel free to contribute to the doc at any time (e.g. add your own use-cases, etc.).

k8s-triage-robot commented 10 months ago

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

devantler commented 9 months ago

Any updates on this topic? I am also very interested - especially in the approach that uses a message broker.

fabriziopandini commented 7 months ago

/priority backlog

fabriziopandini commented 7 months ago

AFAIK the feauture group is not active anymore

@richardcase @fgutmann any update from this side?

fabriziopandini commented 6 months ago

The Cluster API project currently lacks enough active contributors to adequately respond to all issues and PRs. After talking with the folks driving the feature group it seems that we have to table the discussion for now, we can always resurrect in the future if someone has bandwidth + there is more traction around this idea.

/close

k8s-ci-robot commented 6 months ago

@fabriziopandini: Closing this issue.

In response to [this](https://github.com/kubernetes-sigs/cluster-api/issues/6520#issuecomment-2124896291): >The Cluster API project currently lacks enough active contributors to adequately respond to all issues and PRs. >After talking with the folks driving the feature group it seems that we have to table the discussion for now, we can always resurrect in the future if someone has bandwidth + there is more traction around this idea. > >/close > Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

kubernetes-sigs / cluster-api

Supporting managed clusters without direct network connectivity #6520