siderolabs / talos-cloud-controller-manager

Generic cloud controller manager for hybrid deployments using Talos OS
MIT License
52 stars 8 forks source link

CCM using too many open files? #141

Open rsmitty opened 6 months ago

rsmitty commented 6 months ago

Unsure if this is a bug quite yet. But with a customer using the CCM, we're seeing the following in a cluster that scales up and down by several hundred nodes pretty often:

E0410 21:54:21.057323       1 node_controller.go:277] Error getting instance metadata for node addresses: error getting metadata from the node ip-10-2-55-138.ec2.internal: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 10.2.55.138:50000: socket: too many open files"
E0410 21:54:21.057629       1 node_controller.go:277] Error getting instance metadata for node addresses: error getting metadata from the node ip-10-2-85-77.ec2.internal: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 10.2.85.77:50000: socket: too many open files"

The file-max value is very large, 13million+, so I'm doubtful this is a sysctl setting problem. In googling around, we did see that /proc/sys/fs/inotify/max_user_instances was 8192 and could be related to an error like this.

But either way, it feels like maybe there's somewhere we're not closing connections in the CCM that could cause us to hit some limitt?

rsmitty commented 6 months ago

Looking a little further, this seems to come from this call: https://github.com/siderolabs/talos-cloud-controller-manager/blob/main/pkg/talos/instances.go#L64

This in turn calls https://github.com/siderolabs/talos-cloud-controller-manager/blob/main/pkg/talos/client.go#L67. So I'm wondering if this is actually something in COSI. Also notice the COSI version is quite old in the go.mod.

sergelogvinov commented 6 months ago

Thank you for the bug report.

I've checked all my clusters, and did not find file descriptor leaks. Probably because mu clusters do not scale up/down very often.

Lets update dependences first, and I will collect file descriptor statistics.

sergelogvinov commented 6 months ago

Can you add more details please.

What the Talos version do you use, Talos CCM commit hash, and type of deployment of CCM (daemonset/deploy) ?

Thanks

rsmitty commented 6 months ago

I see you already bumped the dependencies, but just to make sure you've got the info: for this customer, CCM is a deployment, Talos version is 1.6.7, and CCM version is latest release (1.4.0).

sergelogvinov commented 6 months ago

I see you already bumped the dependencies, but just to make sure you've got the info: for this customer, CCM is a deployment, Talos version is 1.6.7, and CCM version is latest release (1.4.0).

Oh, release (1.4.0)... try edge version please.

github-actions[bot] commented 4 days ago

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be closed in 14 days.

DmitriyMV commented 4 days ago

@rsmitty was this fixed?