Add support for clusters without DNS or Gossip

hakman commented 1 year ago

/kind feature

1. Describe IN DETAIL the feature/behavior/change you would like to see.

kOps requires DNS alias(es) for various services, like kube-apiserver, kops-controller, ... . DNS (in general) is unreliable on cluster creation and not always easy to create the prerequisites. Gossip, as an alternative is nice, but it requires permissions for the cloud provider to list peers. Limiting those permissions to something reasonable is not always possible, like discussed in #14315.

Most supported cloud providers allow use of a load balancer for Kubernetes API, which could be used also for clusters that don't use any form of DNS.

2. Feel free to provide a design supporting your feature request.

Load balancers are created at the same time as the cluster and are not recreated unless deleted. The (usually) provide a stable address, which can be used in /etc/hosts aliases for the kube-apiserver, kops-controller, ... services.

To enable the feature, one could use the --dns=none flag on cluster creation. There would be no restriction to the cluster name as there are for Gossip.

Unfortunately each cloud provider has its own load balancer implementation and features, so how clusters without DNS work will be dependent on cloud provider features.

[x] Support for AWS
[x] Support for Hetzner
[x] Support for GCE/GCP
[x] Support for Azure
[x] Support for Openstack
[x] Support for Scaleway
[x] Allow migration from Gossip
[ ] Documentation
[ ] Release notes

johngmyers commented 1 year ago

@hakman close this out and open new issue for remaining clouds or retarget this to the 1.27 milestone?

hakman commented 1 year ago

I will close it before the release. Some of the tasks still need addressing before the 1.26 release.

ghost commented 1 year ago

Getting stable seeds pretty much requires knowing something in advance. Using a load balancer will not reliably route to the desired server sometimes.

I saw etcd-manager using this public discovery service maintained by Core OS (maybe Red Hat now?): https://etcd.io/docs/v3.5/dev-internal/discovery_protocol/#public-discovery-service

First request returns a random UUID, which is used as an endpoint to PUT and GET seed info.

DISCO_SVC=$(curl -Ls https://discovery.etcd.io/new)
curl -Ls ${DISCO_SVC}
curl -Ls -X PUT ${DISCO_SVC}/key-name-here -d value=value-goes-here
curl -Ls ${DISCO_SVC}/key-name-here

This can be taken a step further by generating random public/private keys capable of public key encryption. Here are some notes I have on using openssl to perform a key exchange with curve22519: https://gist.github.com/protosam/74c680988d60d53a959e962a705a2bd7

Between public key cryptography, root certificate signing, and a public kv store like that; I'm pretty sure a secure protocol could be derived.

Would pursuing this be of any interest?

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

ghost commented 1 year ago

/remove-lifecycle rotten

Am I allow to issue / commands here? Bout to find out. 🫥

Looks like this is partially done. I didn't find anything else referencing this for the incomplete items. It seems reasonable for this issue to act as host for tracking.

k8s-triage-robot commented 7 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

hakman commented 7 months ago

/remove-lifecycle rotten

hakman commented 7 months ago

/remove-lifecycle stale

k8s-triage-robot commented 4 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

hakman commented 4 months ago

/remove-lifecycle stale

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Artiax commented 3 weeks ago

Cilium uses APIInternalName variable for etcd discovery: https://github.com/kubernetes/kops/blob/v1.29.2/upup/models/cloudup/resources/addons/networking.cilium.io/k8s-1.16-v1.15.yaml.template#L51
--dns=none expects the API to be exposed via the NLB.
/etc/hosts APIInternalName contains the IP addresses of the (NLB) on non control-plane nodes rather than 127.0.0.1
NLB does not expose cilium etcd server port (4003)

All of this combined is resulting in the following on cilium-agents: time="2024-08-01T13:54:43Z" level=error msg="Unable to initialize local node. Retrying..." error="timeout while retrieving initial list of objects from kvstore" subsys=nodediscovery

kubernetes / kops

Add support for clusters without DNS or Gossip #14859