Closed sockyone closed 3 months ago
it not happen with the same cluster.yaml + RKE cli
Hey @sockyone, thanks for reporting the issue here and sorry you had trouble with this.
Have you double-checked any IPs you are specifying to the program? The logs list a localhost IP which might not be what you want.
Alternatively have you tried increasing the timeout on these actions and seeing if that makes a difference?
Also looking at the docs for the RKE Cluster resource, it looks like it's expecting a number for the retention parameter under the etcd service: https://www.pulumi.com/registry/packages/rke/api-docs/cluster/#retention_nodejs. You have retention: ""
- does changing this make a difference?
If none of that works and you are still having problems here, can I ask you for a full repro program of the issue? Ideally you'd include some minimal kube config which reproes the issue as long as any other required config to the RKE Cluster so that we can run the program locally and see the same issue you are seeing.
@VenelinMartinov Thanks for your reply.
The retention parameter is empty but the actual value in cluster.yaml
file is 72h (I think RKE have default value for this field).
I'm being in a VPN so every local IP is reachable.
I'm still facing the problem. I'm using the real machine, so I don't have Minikube config. I'm using 2->3 VM ubuntu 22.04.
About the coding, there is no special, I have only index.ts
file as I mentioned above and another file renderNodes.ts
:
import {types} from "@pulumi/rke"
import { NodeConfig } from "./interfaces"
import { Input, Config } from "@pulumi/pulumi"
const sshUser = new Config("k8srke").require("sshUser")
const checkIfEmptyOrNull = (obj : null|any[]) : boolean => {
if (!obj) return true
if (obj.length == 0) return true
return false
}
export function renderNodes(nodeConfig: NodeConfig) : types.input.ClusterNode[] {
const nodes = []
// render etcd nodes
if (nodeConfig.etcds) {
for (let node of nodeConfig.etcds) {
let _node : types.input.ClusterNode = {
address: node.sshIp ? node.sshIp : node.internalIp,
roles: ["etcd"],
user: sshUser,
port: node.sshPort ? node.sshPort : "22",
// hostnameOverride: node.name,
internalAddress: node.internalIp,
labels: {
"node.rke.io/role": "etcd"
}
}
nodes.push(_node)
}
}
// render master nodes
for (let node of nodeConfig.masters) {
let _node : types.input.ClusterNode = {
address: node.sshIp ? node.sshIp : node.internalIp,
roles: checkIfEmptyOrNull(nodeConfig.etcds) ? ["controlplane", "etcd"] : ["controlplane"],
user: sshUser,
port: node.sshPort ? node.sshPort : "22",
// hostnameOverride: node.name,
internalAddress: node.internalIp,
labels: {
"node.rke.io/role": "master"
}
}
nodes.push(_node)
}
//render worker nodes
for (let workerGroup of nodeConfig.workerGroups) {
for (let node of workerGroup.nodes) {
let _node : types.input.ClusterNode = {
address: node.sshIp ? node.sshIp : node.internalIp,
roles: ["worker"],
user: sshUser,
port: node.sshPort ? node.sshPort : "22",
// hostnameOverride: node.name,
internalAddress: node.internalIp,
labels: workerGroup.labels,
taints: workerGroup.taints
}
nodes.push(_node)
}
}
return nodes
}
Stack yaml file:
config:
rke:debug: true
k8srke:cluster-name: staging
k8srke:kubernetes-version: v1.26.14-rancher1-1
k8srke:sshUser: infra
k8srke:sshPrivateKeyPath: /tmp/infra
k8srke:nodes:
masters:
- name: k8s-master-01
internalIp: 192.168.2.21
workerGroups:
- name: app-group
labels:
kubernetes.io/node-group: app-group
nodes:
- name: k8s-worker-app-01
internalIp: 192.168.2.23
k8srke:services:
kubeController:
clusterCidr: 10.42.0.0/16
serviceClusterIpRange: 10.43.0.0/16
Hi @sockyone, looking at your original pulumi-program:
services: {
etcd: {
retention: ""
},
Have you tried removing the retention parameter here?
If that does not work can you please include a minimal full pulumi program which exhibits the issue so we can reproduce it? Looking at the code you pasted it still looks like it references other code.
You can create a new pulumi stack and try to reproduce the issue there, once you do please add the code here so we can run it ourselves and see what is going wrong.
hi @VenelinMartinov, nothing special in my program, it look like a starter program. You can try to insert your sshPath to this code:
import * as pulumi from "@pulumi/pulumi"
import * as rke from "@pulumi/rke"
const rkeConfig = new pulumi.Config("k8srke")
const sshPrivateKeyPath = rkeConfig.require("sshPrivateKeyPath")
const cluster = new rke.Cluster("cluster", {
nodes: [
{
address: '192.168.2.28',
roles: [ 'controlplane', 'etcd' ],
user: 'pyinfra',
port: '22',
internalAddress: '192.168.2.28',
labels: { 'node.rke.io/role': 'master' }
},
{
address: '192.168.2.23',
roles: [ 'worker' ],
user: 'pyinfra',
port: '22',
internalAddress: '192.168.2.23',
labels: { 'kubernetes.io/node-group': 'app-group' },
taints: undefined
}
],
kubernetesVersion: "v1.28.7-rancher1-1",
// v1
clusterName: "staging",
sshKeyPath: sshPrivateKeyPath,
ignoreDockerVersion: true,
enableCriDockerd: true,
services: {
kubeController: {
clusterCidr: "10.42.0.0/16",
},
},
ingress: {
provider: "none"
},
network: {
plugin: "calico"
}
})
export const kubeConfigYaml = cluster.kubeConfigYaml
Hi @sockyone - unfortunately I am still unable to generate a cluster with the program as-provided. I'm not a Rancher expert - is there some setup that is missing?
With a dedicated, un-passphrased ssh key, the code provided does not execute for me because an SSH tunnel cannot be established (timeout). Am I missing an initial cluster setup?
Failed running cluster err:Cluster must have at least one etcd plane host: please specify one or more etcd in cluster config
@guineveresaenger timeout error maybe because you don't have access to the server, or you can't reach the server? Do you have other VM machines with IP "192.168.2.23" and "192.168.2.28". Can you replace your server IPs. please specify one or more etcd in cluster config => this error is just about you can't reach the master host so it will try to skip then found there is no master node in cluster.
@sockyone - aha, as I suspected there are missing servers in my setup. As stated before, I'm not particularly familiar with Rancher. How are you setting up your servers? I had assumed your program was self-contained, i.e. would provision the servers in question.
@guineveresaenger About servers, they are Ubuntu 22.04 + docker. You can try setting up these VMs as containers on your local machine.
@sockyone - I'm really sorry but we do need a complete, easily runnable repro of your setup here, so we can determine if this is even a bug on the pulumi side. We don't have the operational resources to emulate your environment otherwise.
@guineveresaenger are there any other ways? This is like a provision tool so I can't send you the whole infra for debugging. I can give you more information about my set up if you need. I tested it with the same set of VMs and cluster.yaml generated from Pulumi, and it works. So I don't think it is about my remote machines.
Apologies again - let me be more clear. Without a self-contained Pulumi program that reproduces the bug, we are unable to prioritize this issue. For example, the program could deploy some vms and then provision a rancher k8s cluster on top of them. Without that, we don’t have the expertise to quickly make progress on this issue.
What happened?
After creating new cluster successfully, any update will get the error: "etcd tls bad certificate".
Example
index.ts
Run this code to create new cluster. Then try to change anything in the configuration and rerun => Fail.
Output of
pulumi about
Additional context
No response
Contributing
Vote on this issue by adding a 👍 reaction. To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).