ETCD tls bad certificate

sockyone commented 5 months ago

What happened?

After creating new cluster successfully, any update will get the error: "etcd tls bad certificate".

time="2024-05-09T21:51:19+07:00" level=info msg="Image [rancher/rke-tools:v0.1.96] exists on host [192.168.2.23]"
    time="2024-05-09T21:51:20+07:00" level=info msg="Starting container [rke-log-linker] on host [192.168.2.23], try #1"
    time="2024-05-09T21:51:21+07:00" level=info msg="[etcd] Successfully started [rke-log-linker] container on host [192.168.2.23]"
    time="2024-05-09T21:51:21+07:00" level=debug msg="[remove/rke-log-linker] Checking if container is running on host [192.168.2.23]"
    time="2024-05-09T21:51:21+07:00" level=debug msg="[remove/rke-log-linker] Removing container on host [192.168.2.23]"
    time="2024-05-09T21:51:21+07:00" level=info msg="Removing container [rke-log-linker] on host [192.168.2.23], try #1"
    time="2024-05-09T21:51:21+07:00" level=info msg="[remove/rke-log-linker] Successfully removed container on host [192.168.2.23]"
    time="2024-05-09T21:51:21+07:00" level=debug msg="[etcd] Successfully created log link for Container [etcd] on host [192.168.2.23]"
    time="2024-05-09T21:51:21+07:00" level=info msg="[etcd] Successfully started etcd plane.. Checking etcd cluster health"
    time="2024-05-09T21:51:21+07:00" level=debug msg="[etcd] check etcd cluster health on host [192.168.2.21]"
    time="2024-05-09T21:51:25+07:00" level=debug msg="[etcd] failed to check health for etcd host [192.168.2.21]: failed to get /health for host [192.168.2.21]: Get \"https://192.168.2.21:2379/health\": remote error: tls: bad certificate"
    time="2024-05-09T21:51:31+07:00" level=debug msg="[etcd] failed to check health for etcd host [192.168.2.21]: failed to get /health for host [192.168.2.21]: Get \"https://192.168.2.21:2379/health\": remote error: tls: bad certificate"
    time="2024-05-09T21:51:37+07:00" level=debug msg="[etcd] failed to check health for etcd host [192.168.2.21]: failed to get /health for host [192.168.2.21]: Get \"https://192.168.2.21:2379/health\": remote error: tls: bad certificate"
    time="2024-05-09T21:51:43+07:00" level=debug msg="[etcd] failed to check health for etcd host [192.168.2.21]: failed to get /health for host [192.168.2.21]: Get \"https://192.168.2.21:2379/health\": remote error: tls: bad certificate"
    time="2024-05-09T21:51:50+07:00" level=debug msg="[etcd] failed to check health for etcd host [192.168.2.21]: failed to get /health for host [192.168.2.21]: Get \"https://192.168.2.21:2379/health\": remote error: tls: bad certificate"
    time="2024-05-09T21:51:56+07:00" level=debug msg="[etcd] failed to check health for etcd host [192.168.2.21]: failed to get /health for host [192.168.2.21]: Get \"https://192.168.2.21:2379/health\": remote error: tls: bad certificate"
    time="2024-05-09T21:52:02+07:00" level=debug msg="[etcd] failed to check health for etcd host [192.168.2.21]: failed to get /health for host [192.168.2.21]: Get \"https://192.168.2.21:2379/health\": remote error: tls: bad certificate"
    time="2024-05-09T21:52:08+07:00" level=debug msg="[etcd] failed to check health for etcd host [192.168.2.21]: failed to get /health for host [192.168.2.21]: Get \"https://192.168.2.21:2379/health\": remote error: tls: bad certificate"
    time="2024-05-09T21:52:14+07:00" level=debug msg="[etcd] failed to check health for etcd host [192.168.2.21]: failed to get /health for host [192.168.2.21]: Get \"https://192.168.2.21:2379/health\": remote error: tls: bad certificate"
    time="2024-05-09T21:52:20+07:00" level=debug msg="[etcd] failed to check health for etcd host [192.168.2.21]: failed to get /health for host [192.168.2.21]: Get \"https://192.168.2.21:2379/health\": remote error: tls: bad certificate"
    time="2024-05-09T21:52:27+07:00" level=debug msg="[etcd] failed to check health for etcd host [192.168.2.21]: failed to get /health for host [192.168.2.21]: Get \"https://192.168.2.21:2379/health\": remote error: tls: bad certificate"
    time="2024-05-09T21:52:34+07:00" level=debug msg="[etcd] failed to check health for etcd host [192.168.2.21]: failed to get /health for host [192.168.2.21]: Get \"https://192.168.2.21:2379/health\": remote error: tls: bad certificate"
    time="2024-05-09T21:52:40+07:00" level=debug msg="[etcd] failed to check health for etcd host [192.168.2.21]: failed to get /health for host [192.168.2.21]: Get \"https://192.168.2.21:2379/health\": remote error: tls: bad certificate"
    time="2024-05-09T21:52:46+07:00" level=debug msg="[etcd] failed to check health for etcd host [192.168.2.21]: failed to get /health for host [192.168.2.21]: Get \"https://192.168.2.21:2379/health\": remote error: tls: bad certificate"
    time="2024-05-09T21:52:52+07:00" level=debug msg="[etcd] failed to check health for etcd host [192.168.2.21]: failed to get /health for host [192.168.2.21]: Get \"https://192.168.2.21:2379/health\": remote error: tls: bad certificate"
    time="2024-05-09T21:52:59+07:00" level=debug msg="[etcd] failed to check health for etcd host [192.168.2.21]: failed to get /health for host [192.168.2.21]: Get \"https://192.168.2.21:2379/health\": remote error: tls: bad certificate"
    time="2024-05-09T21:53:05+07:00" level=debug msg="[etcd] failed to check health for etcd host [192.168.2.21]: failed to get /health for host [192.168.2.21]: Get \"https://192.168.2.21:2379/health\": remote error: tls: bad certificate"
    time="2024-05-09T21:53:11+07:00" level=debug msg="[etcd] failed to check health for etcd host [192.168.2.21]: failed to get /health for host [192.168.2.21]: Get \"https://192.168.2.21:2379/health\": remote error: tls: bad certificate"
    time="2024-05-09T21:53:16+07:00" level=warning msg="[etcd] host [192.168.2.21] failed to check etcd health: failed to get /health for host [192.168.2.21]: Get \"https://192.168.2.21:2379/health\": remote error: tls: bad certificate"
    time="2024-05-09T21:53:16+07:00" level=debug msg="[etcd] check etcd cluster health on host [192.168.2.23]"
    time="2024-05-09T21:53:19+07:00" level=debug msg="[etcd] failed to check health for etcd host [192.168.2.23]: failed to get /health for host [192.168.2.23]: Get \"https://192.168.2.23:2379/health\": remote error: tls: bad certificate"
    time="2024-05-09T21:53:26+07:00" level=debug msg="[etcd] failed to check health for etcd host [192.168.2.23]: failed to get /health for host [192.168.2.23]: Get \"https://192.168.2.23:2379/health\": remote error: tls: bad certificate"
    time="2024-05-09T21:53:32+07:00" level=debug msg="[etcd] failed to check health for etcd host [192.168.2.23]: failed to get /health for host [192.168.2.23]: Get \"https://192.168.2.23:2379/health\": remote error: tls: bad certificate"
    time="2024-05-09T21:53:38+07:00" level=debug msg="[etcd] failed to check health for etcd host [192.168.2.23]: failed to get /health for host [192.168.2.23]: Get \"https://192.168.2.23:2379/health\": remote error: tls: bad certificate"
    time="2024-05-09T21:53:44+07:00" level=debug msg="[etcd] failed to check health for etcd host [192.168.2.23]: failed to get /health for host [192.168.2.23]: Get \"https://192.168.2.23:2379/health\": remote error: tls: bad certificate"
    time="2024-05-09T21:53:50+07:00" level=debug msg="[etcd] failed to check health for etcd host [192.168.2.23]: failed to get /health for host [192.168.2.23]: Get \"https://192.168.2.23:2379/health\": remote error: tls: bad certificate"
    time="2024-05-09T21:53:57+07:00" level=debug msg="[etcd] failed to check health for etcd host [192.168.2.23]: failed to get /health for host [192.168.2.23]: Get \"https://192.168.2.23:2379/health\": remote error: tls: bad certificate"
    time="2024-05-09T21:54:03+07:00" level=debug msg="[etcd] failed to check health for etcd host [192.168.2.23]: failed to get /health for host [192.168.2.23]: Get \"https://192.168.2.23:2379/health\": remote error: tls: bad certificate"
    time="2024-05-09T21:54:09+07:00" level=debug msg="[etcd] failed to check health for etcd host [192.168.2.23]: failed to get /health for host [192.168.2.23]: Get \"https://192.168.2.23:2379/health\": remote error: tls: bad certificate"
    time="2024-05-09T21:54:15+07:00" level=debug msg="[etcd] failed to check health for etcd host [192.168.2.23]: failed to get /health for host [192.168.2.23]: Get \"https://192.168.2.23:2379/health\": remote error: tls: bad certificate"
    time="2024-05-09T21:54:22+07:00" level=debug msg="[etcd] failed to check health for etcd host [192.168.2.23]: failed to get /health for host [192.168.2.23]: Get \"https://192.168.2.23:2379/health\": remote error: tls: bad certificate"
    time="2024-05-09T21:54:28+07:00" level=debug msg="[etcd] failed to check health for etcd host [192.168.2.23]: failed to get /health for host [192.168.2.23]: Get \"https://192.168.2.23:2379/health\": remote error: tls: bad certificate"
    time="2024-05-09T21:54:35+07:00" level=debug msg="[etcd] failed to check health for etcd host [192.168.2.23]: failed to get /health for host [192.168.2.23]: Get \"https://192.168.2.23:2379/health\": remote error: tls: bad certificate"
    time="2024-05-09T21:54:41+07:00" level=debug msg="[etcd] failed to check health for etcd host [192.168.2.23]: failed to get /health for host [192.168.2.23]: Get \"https://192.168.2.23:2379/health\": remote error: tls: bad certificate"
    time="2024-05-09T21:54:47+07:00" level=debug msg="[etcd] failed to check health for etcd host [192.168.2.23]: failed to get /health for host [192.168.2.23]: Get \"https://192.168.2.23:2379/health\": remote error: tls: bad certificate"
    time="2024-05-09T21:54:53+07:00" level=debug msg="[etcd] failed to check health for etcd host [192.168.2.23]: failed to get /health for host [192.168.2.23]: Get \"https://192.168.2.23:2379/health\": remote error: tls: bad certificate"
    time="2024-05-09T21:54:59+07:00" level=debug msg="[etcd] failed to check health for etcd host [192.168.2.23]: failed to get /health for host [192.168.2.23]: Get \"https://192.168.2.23:2379/health\": remote error: tls: bad certificate"
    time="2024-05-09T21:55:05+07:00" level=debug msg="[etcd] failed to check health for etcd host [192.168.2.23]: failed to get /health for host [192.168.2.23]: Get \"https://192.168.2.23:2379/health\": remote error: tls: bad certificate"
    time="2024-05-09T21:55:10+07:00" level=warning msg="[etcd] host [192.168.2.23] failed to check etcd health: failed to get /health for host [192.168.2.23]: Get \"https://192.168.2.23:2379/health\": remote error: tls: bad certificate"

Example

index.ts

Run this code to create new cluster. Then try to change anything in the configuration and rerun => Fail.

import * as pulumi from "@pulumi/pulumi"
import * as rke from "@pulumi/rke"
import {NodeConfig} from "./interfaces"
import {renderNodes} from "./renderNodes"

const rkeConfig = new pulumi.Config("k8srke")
const clusterName = rkeConfig.require("cluster-name")
const kubernetesVersion = rkeConfig.require("kubernetes-version")
const sshPrivateKeyPath = rkeConfig.require("sshPrivateKeyPath")
const nodes = rkeConfig.requireObject<NodeConfig>("nodes")

const clusterNodes = renderNodes(nodes)

const cluster = new rke.Cluster("cluster", {
    nodes: clusterNodes,
    kubernetesVersion: kubernetesVersion,
    clusterName: clusterName,
    sshKeyPath: sshPrivateKeyPath,
    ignoreDockerVersion: true,
    enableCriDockerd: true,
    services: {
      etcd: {
        retention: ""
      },
    //   kubeApi: {
    //   },
      kubeController: {
        clusterCidr: "10.42.0.0/16",
      },
    },
    ingress: {
        provider: "none"
    },
    network: {
        plugin: "calico"
    }
})

export const kubeConfigYaml = cluster.kubeConfigYaml

Output of `pulumi about`

CLI          
Version      3.115.1
Go Version   go1.22.2
Go Compiler  gc

Plugins
KIND      NAME    VERSION
language  nodejs  unknown
resource  rke     3.4.0

Host     
OS       darwin
Version  14.4.1
Arch     arm64

This project is written in nodejs: executable='/opt/homebrew/bin/node' version='v20.12.2'

Current Stack: organization/rke/staging

TYPE                       URN
pulumi:pulumi:Stack        urn:pulumi:staging::rke::pulumi:pulumi:Stack::rke-staging
pulumi:providers:rke       urn:pulumi:staging::rke::pulumi:providers:rke::default_3_4_0
rke:index/cluster:Cluster  urn:pulumi:staging::rke::rke:index/cluster:Cluster::cluster

Found no pending operations associated with staging

Backend        
Name           Nams-MacBook-Pro.local
URL            s3://upbase-sre
User           namphan
Organizations  
Token type     personal

Dependencies:
NAME            VERSION
typescript      5.4.5
@pulumi/pulumi  3.113.3
@pulumi/rke     3.4.0
@types/node     18.19.31

Pulumi locates its logs in /var/folders/pg/tw39ydcd4zl6qm6_1pk8br1r0000gn/T/ by default

Additional context

No response

Contributing

Vote on this issue by adding a 👍 reaction. To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

sockyone commented 5 months ago

it not happen with the same cluster.yaml + RKE cli

VenelinMartinov commented 5 months ago

Hey @sockyone, thanks for reporting the issue here and sorry you had trouble with this.

Have you double-checked any IPs you are specifying to the program? The logs list a localhost IP which might not be what you want.

Alternatively have you tried increasing the timeout on these actions and seeing if that makes a difference?

Also looking at the docs for the RKE Cluster resource, it looks like it's expecting a number for the retention parameter under the etcd service: https://www.pulumi.com/registry/packages/rke/api-docs/cluster/#retention_nodejs. You have retention: "" - does changing this make a difference?

If none of that works and you are still having problems here, can I ask you for a full repro program of the issue? Ideally you'd include some minimal kube config which reproes the issue as long as any other required config to the RKE Cluster so that we can run the program locally and see the same issue you are seeing.

sockyone commented 5 months ago

@VenelinMartinov Thanks for your reply. The retention parameter is empty but the actual value in cluster.yaml file is 72h (I think RKE have default value for this field). I'm being in a VPN so every local IP is reachable. I'm still facing the problem. I'm using the real machine, so I don't have Minikube config. I'm using 2->3 VM ubuntu 22.04. About the coding, there is no special, I have only index.ts file as I mentioned above and another file renderNodes.ts:

import {types} from "@pulumi/rke"
import { NodeConfig } from "./interfaces"
import { Input, Config } from "@pulumi/pulumi"

const sshUser = new Config("k8srke").require("sshUser")

const checkIfEmptyOrNull = (obj : null|any[]) : boolean => {
    if (!obj) return true
    if (obj.length == 0) return true

    return false
}

export function renderNodes(nodeConfig: NodeConfig) : types.input.ClusterNode[] {
    const nodes = []

    // render etcd nodes
    if (nodeConfig.etcds) {
        for (let node of nodeConfig.etcds) {
            let _node : types.input.ClusterNode = {
                address: node.sshIp ? node.sshIp : node.internalIp,
                roles: ["etcd"],
                user: sshUser,
                port: node.sshPort ? node.sshPort : "22",
                // hostnameOverride: node.name,
                internalAddress: node.internalIp,
                labels: {
                    "node.rke.io/role": "etcd"
                }
            }
            nodes.push(_node)
        }
    }

    // render master nodes
    for (let node of nodeConfig.masters) {
        let _node : types.input.ClusterNode = {
            address: node.sshIp ? node.sshIp : node.internalIp,
            roles: checkIfEmptyOrNull(nodeConfig.etcds) ? ["controlplane", "etcd"] : ["controlplane"],
            user: sshUser,
            port: node.sshPort ? node.sshPort : "22",
            // hostnameOverride: node.name,
            internalAddress: node.internalIp,
            labels: {
                "node.rke.io/role": "master"
            }
        }
        nodes.push(_node)
    }

    //render worker nodes
    for (let workerGroup of nodeConfig.workerGroups) {
        for (let node of workerGroup.nodes) {
            let _node : types.input.ClusterNode = {
                address: node.sshIp ? node.sshIp : node.internalIp,
                roles: ["worker"],
                user: sshUser,
                port: node.sshPort ? node.sshPort : "22",
                // hostnameOverride: node.name,
                internalAddress: node.internalIp,
                labels: workerGroup.labels,
                taints: workerGroup.taints
            }
            nodes.push(_node)
        }
    }

    return nodes
}

Stack yaml file:

config:
  rke:debug: true
  k8srke:cluster-name: staging
  k8srke:kubernetes-version: v1.26.14-rancher1-1
  k8srke:sshUser: infra
  k8srke:sshPrivateKeyPath: /tmp/infra
  k8srke:nodes:
    masters:
    - name: k8s-master-01
      internalIp: 192.168.2.21
    workerGroups:
    - name: app-group
      labels:
        kubernetes.io/node-group: app-group
      nodes:
      - name: k8s-worker-app-01
        internalIp: 192.168.2.23
  k8srke:services:
    kubeController:
      clusterCidr: 10.42.0.0/16
      serviceClusterIpRange: 10.43.0.0/16

VenelinMartinov commented 5 months ago

Hi @sockyone, looking at your original pulumi-program:

services: {
      etcd: {
        retention: ""
      },

Have you tried removing the retention parameter here?

If that does not work can you please include a minimal full pulumi program which exhibits the issue so we can reproduce it? Looking at the code you pasted it still looks like it references other code.

You can create a new pulumi stack and try to reproduce the issue there, once you do please add the code here so we can run it ourselves and see what is going wrong.

sockyone commented 5 months ago

hi @VenelinMartinov, nothing special in my program, it look like a starter program. You can try to insert your sshPath to this code:

import * as pulumi from "@pulumi/pulumi"
import * as rke from "@pulumi/rke"

const rkeConfig = new pulumi.Config("k8srke")
const sshPrivateKeyPath = rkeConfig.require("sshPrivateKeyPath")

const cluster = new rke.Cluster("cluster", {
    nodes: [
        {
          address: '192.168.2.28',
          roles: [ 'controlplane', 'etcd' ],
          user: 'pyinfra',
          port: '22',
          internalAddress: '192.168.2.28',
          labels: { 'node.rke.io/role': 'master' }
        },
        {
          address: '192.168.2.23',
          roles: [ 'worker' ],
          user: 'pyinfra',
          port: '22',
          internalAddress: '192.168.2.23',
          labels: { 'kubernetes.io/node-group': 'app-group' },
          taints: undefined
        }
    ],
    kubernetesVersion: "v1.28.7-rancher1-1",
    // v1
    clusterName: "staging",
    sshKeyPath: sshPrivateKeyPath,
    ignoreDockerVersion: true,
    enableCriDockerd: true,
    services: {
      kubeController: {
        clusterCidr: "10.42.0.0/16",
      },
    },
    ingress: {
        provider: "none"
    },
    network: {
        plugin: "calico"
    }
})

export const kubeConfigYaml = cluster.kubeConfigYaml

guineveresaenger commented 5 months ago

Hi @sockyone - unfortunately I am still unable to generate a cluster with the program as-provided. I'm not a Rancher expert - is there some setup that is missing?

With a dedicated, un-passphrased ssh key, the code provided does not execute for me because an SSH tunnel cannot be established (timeout). Am I missing an initial cluster setup?

Failed running cluster err:Cluster must have at least one etcd plane host: please specify one or more etcd in cluster config

sockyone commented 5 months ago

@guineveresaenger timeout error maybe because you don't have access to the server, or you can't reach the server? Do you have other VM machines with IP "192.168.2.23" and "192.168.2.28". Can you replace your server IPs. please specify one or more etcd in cluster config => this error is just about you can't reach the master host so it will try to skip then found there is no master node in cluster.

guineveresaenger commented 5 months ago

@sockyone - aha, as I suspected there are missing servers in my setup. As stated before, I'm not particularly familiar with Rancher. How are you setting up your servers? I had assumed your program was self-contained, i.e. would provision the servers in question.

sockyone commented 5 months ago

@guineveresaenger About servers, they are Ubuntu 22.04 + docker. You can try setting up these VMs as containers on your local machine.

guineveresaenger commented 5 months ago

@sockyone - I'm really sorry but we do need a complete, easily runnable repro of your setup here, so we can determine if this is even a bug on the pulumi side. We don't have the operational resources to emulate your environment otherwise.

sockyone commented 5 months ago

@guineveresaenger are there any other ways? This is like a provision tool so I can't send you the whole infra for debugging. I can give you more information about my set up if you need. I tested it with the same set of VMs and cluster.yaml generated from Pulumi, and it works. So I don't think it is about my remote machines.

guineveresaenger commented 5 months ago

Apologies again - let me be more clear. Without a self-contained Pulumi program that reproduces the bug, we are unable to prioritize this issue. For example, the program could deploy some vms and then provision a rancher k8s cluster on top of them. Without that, we don’t have the expertise to quickly make progress on this issue.

pulumi / pulumi-rke