Open rsmitty opened 6 months ago
Looking a little further, this seems to come from this call: https://github.com/siderolabs/talos-cloud-controller-manager/blob/main/pkg/talos/instances.go#L64
This in turn calls https://github.com/siderolabs/talos-cloud-controller-manager/blob/main/pkg/talos/client.go#L67. So I'm wondering if this is actually something in COSI. Also notice the COSI version is quite old in the go.mod.
Thank you for the bug report.
I've checked all my clusters, and did not find file descriptor leaks. Probably because mu clusters do not scale up/down very often.
Lets update dependences first, and I will collect file descriptor statistics.
Can you add more details please.
What the Talos version do you use, Talos CCM commit hash, and type of deployment of CCM (daemonset/deploy) ?
Thanks
I see you already bumped the dependencies, but just to make sure you've got the info: for this customer, CCM is a deployment, Talos version is 1.6.7, and CCM version is latest release (1.4.0).
I see you already bumped the dependencies, but just to make sure you've got the info: for this customer, CCM is a deployment, Talos version is 1.6.7, and CCM version is latest release (1.4.0).
Oh, release (1.4.0)... try edge version please.
This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be closed in 14 days.
@rsmitty was this fixed?
Unsure if this is a bug quite yet. But with a customer using the CCM, we're seeing the following in a cluster that scales up and down by several hundred nodes pretty often:
The file-max value is very large, 13million+, so I'm doubtful this is a sysctl setting problem. In googling around, we did see that
/proc/sys/fs/inotify/max_user_instances
was 8192 and could be related to an error like this.But either way, it feels like maybe there's somewhere we're not closing connections in the CCM that could cause us to hit some limitt?