talos_cluster_health also checks destroyed VM's IP

hakobian4 commented 5 months ago

Bug Report

Description

I have created Kubernetes cluster with 3 nodes using Terraform: 1 control-plane and 2 worker-nodes. The terraform apply has been completed successfully. After that I decided to remove one of the worker-nodes and decreased the count of the worker-nodes. Then I saw this error. The problem is the talos_cluster_health still remembers the removed node IP address (192.168.2.246). I have checked in state file, it isn't there. Is anyone came across this issue?

Logs

data.talos_cluster_health.cluster_health["armen"]: Still reading... [4m30s elapsed]
data.talos_cluster_health.cluster_health["armen"]: Still reading... [4m40s elapsed]
data.talos_cluster_health.cluster_health["armen"]: Still reading... [4m50s elapsed]
data.talos_cluster_health.cluster_health["armen"]: Still reading... [5m0s elapsed]
╷
│ Error: health check messages:
│ discovered nodes: ["192.168.2.245" "192.168.2.244"]
│ waiting for etcd to be healthy: ...
│ waiting for etcd to be healthy: OK
│ waiting for etcd members to be consistent across nodes: ...
│ waiting for etcd members to be consistent across nodes: OK
│ waiting for etcd members to be control plane nodes: ...
│ waiting for etcd members to be control plane nodes: OK
│ waiting for apid to be ready: ...
│ waiting for apid to be ready: OK
│ waiting for all nodes memory sizes: ...
│ waiting for all nodes memory sizes: OK
│ waiting for all nodes disk sizes: ...
│ waiting for all nodes disk sizes: OK
│ waiting for kubelet to be healthy: ...
│ waiting for kubelet to be healthy: OK
│ waiting for all nodes to finish boot sequence: ...
│ waiting for all nodes to finish boot sequence: OK
│ waiting for all k8s nodes to report: ...
│ waiting for all k8s nodes to report: unexpected nodes with IPs ["192.168.2.246"]
│ 
│ 
│   with data.talos_cluster_health.cluster_health["armen"],
│   on talos.tf line 94, in data "talos_cluster_health" "cluster_health":
│   94: data "talos_cluster_health" "cluster_health" {
│ 
│ rpc error: code = DeadlineExceeded desc = context deadline exceeded

Environment

Talos version: v1.7.1
Kubernetes version: v1.30.0
Platform: Bare-metal

smira commented 5 months ago

When you remove a Talos machine without resetting (talosctl reset), it stays in the Discovery Service for a max TTL of 30 mins. This issue should be probably in the TF provider, as TF provider instead of relying on Talos self-detection could pass the exact cluster topology (expected number of nodes and their addresses) which would fix this issue.

There's nothing to fix on Talos side here, so please open an issue for the TF provider.

frezbo commented 5 months ago

the TF provider already supports passing in the workers and control plane node ips

edikmkoyan commented 5 months ago

@smira 1. As @frezbo said, the provider passes the node IPs. 2. The issue persists after 30 minutes; thus, there must be an issue with the self-detection mechanism.

smira commented 5 months ago

@smira 1. As @frezbo said, the provider passes the node IPs. 2. The issue persists after 30 minutes; thus, there must be an issue with the self-detection mechanism.

If you pass node IPs, there's no self-detection. So the issue is on your side if you pass it wrong.

If you don't pass node IPs, and there's issue with discovery, it's something in your cluster, and depends on how you configure it. But if you did proper talosctl reset before removing it, there would be no issue. I don't know how you had your cluster configured, e.g. you could use Kubernetes discovery and you didn't remove the node.

There's nothing here so far to fix on Talos side.

talosctl get members - if it shows stale nodes - start digging from it.

hakobian4 commented 5 months ago

@smira I have passed the correct IP addresses, but it keeps checking the IP address of VM, which is removed.

frezbo commented 5 months ago

Have you passed in both control plane and worker node ips?

hakobian4 commented 5 months ago

@frezbo Yes, I have done

frezbo commented 5 months ago

Could you post the snippet of code used for talos_cluster_health?

hakobian4 commented 5 months ago

data "talos_cluster_health" "cluster_health" {
  for_each = local.clusters

  client_configuration = talos_machine_secrets.talos_secrets.client_configuration
  control_plane_nodes  = [proxmox_vm_qemu.controlplane[each.key].default_ipv4_address]
  worker_nodes         = [for worker in proxmox_vm_qemu.worker : worker.default_ipv4_address]
  endpoints            = [proxmox_vm_qemu.controlplane[each.key].default_ipv4_address]
  timeouts = {
    read = "1h"
  }
}

edikmkoyan commented 4 months ago

@frezbo can you please reflect on this? https://github.com/siderolabs/terraform-provider-talos/issues/166#issuecomment-2160392283

frezbo commented 4 months ago

could you try this version https://github.com/siderolabs/terraform-provider-talos/releases/tag/v0.6.0-alpha.0

siderolabs / terraform-provider-talos