Closed hakobian4 closed 5 months ago
When you remove a Talos machine without resetting (talosctl reset
), it stays in the Discovery Service for a max TTL of 30 mins. This issue should be probably in the TF provider, as TF provider instead of relying on Talos self-detection could pass the exact cluster topology (expected number of nodes and their addresses) which would fix this issue.
There's nothing to fix on Talos side here, so please open an issue for the TF provider.
the TF provider already supports passing in the workers and control plane node ips
@smira 1. As @frezbo said, the provider passes the node IPs. 2. The issue persists after 30 minutes; thus, there must be an issue with the self-detection mechanism.
@smira 1. As @frezbo said, the provider passes the node IPs. 2. The issue persists after 30 minutes; thus, there must be an issue with the self-detection mechanism.
If you pass node IPs, there's no self-detection. So the issue is on your side if you pass it wrong.
If you don't pass node IPs, and there's issue with discovery, it's something in your cluster, and depends on how you configure it. But if you did proper talosctl reset
before removing it, there would be no issue. I don't know how you had your cluster configured, e.g. you could use Kubernetes discovery and you didn't remove the node.
There's nothing here so far to fix on Talos side.
talosctl get members
- if it shows stale nodes - start digging from it.
@smira I have passed the correct IP addresses, but it keeps checking the IP address of VM, which is removed.
Have you passed in both control plane and worker node ips?
@frezbo Yes, I have done
Could you post the snippet of code used for talos_cluster_health
?
data "talos_cluster_health" "cluster_health" {
for_each = local.clusters
client_configuration = talos_machine_secrets.talos_secrets.client_configuration
control_plane_nodes = [proxmox_vm_qemu.controlplane[each.key].default_ipv4_address]
worker_nodes = [for worker in proxmox_vm_qemu.worker : worker.default_ipv4_address]
endpoints = [proxmox_vm_qemu.controlplane[each.key].default_ipv4_address]
timeouts = {
read = "1h"
}
}
@frezbo can you please reflect on this? https://github.com/siderolabs/terraform-provider-talos/issues/166#issuecomment-2160392283
could you try this version https://github.com/siderolabs/terraform-provider-talos/releases/tag/v0.6.0-alpha.0
Bug Report
Description
I have created Kubernetes cluster with 3 nodes using Terraform: 1 control-plane and 2 worker-nodes. The
terraform apply
has been completed successfully. After that I decided to remove one of the worker-nodes and decreased the count of the worker-nodes. Then I saw this error. The problem is thetalos_cluster_health
still remembers the removed node IP address (192.168.2.246). I have checked in state file, it isn't there. Is anyone came across this issue?Logs
Environment
v1.7.1
v1.30.0