kubernetes / cloud-provider-openstack

Apache License 2.0
623 stars 611 forks source link

[occm] all rules erroneously removed from all lb-sg security groups for unknown reason, breaking access through all load balancers #2699

Closed judge-red closed 1 week ago

judge-red commented 1 month ago

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind bug /kind feature

What happened:

The OCCM removed all rules from all lb-sg security groups for no reasons. This happened 5 times in total last week on 2 clusters, but never before and never since, not sure what could have triggered this behavior, but it caused a complete downtime on everything exposed by a load balancer. Luckily the resolution is simple and quick, restarting the OCCM leads to immediate recreation of the correct rules.

What you expected to happen:

Don't delete all the security group rules that are (still) required.

How to reproduce it:

Unfortunately, the trigger isn't known. Maybe an action by the cluster-autoscaler was involved - we see in the logs (including the OCCM's log) that a node was added just before this happend, but adding nodes alone doesn't seem to trigger the issue.

Anything else we need to know?:

I'll add the log of the occm that included this happening. Verbosity was set to 4. Unfortunately we only have the log of the last time it occured even though it happened 3 times on the same cluster and 2 times on another cluster. Weirdly it happened at almost the same time (within an hour or so) on both clusters, thus the idea that an overloaded OpenStack API or network or such might have exposed this issue. The third time we had the cluster-autoscaler disabled on one cluster and it only happened on the other cluster, thus the idea that its actions and thus a node addition (or maybe also removal) might also be a contributing factor.

openstack-cloud-controller-manager-kmswx.log

Environment:

Edit: the logs of the comments below can be found in this folder: https://drive.switch.ch/index.php/s/IMoXWTaJWrfdB5Q

judge-red commented 3 weeks ago

This just happened again and suspiciously it again happened just after the cluster-autoscaler downscaled a nodegroup and had a worker node deleted.

cluster-autoscaler started the scale down at I1030 10:32:24.538958. Most relevant log line seems to be:

I1030 10:32:25.140710       1 actuator.go:215] Scale-down: removing node sck-production-scc-zhw-pool-ubuntu-jammy-1-7b7ffdfdd8-2422k, utilization: {0.1125 0.045306618034575155 0 cpu 0.1125}, pods to reschedule: apiserver-84ccf6f5bc-4tztl

Looking at the machine-controller, we can furthermore see that this node was short-lived. It was created at 2024-10-30T10:07:25.023Z and gone by 2024-10-30T10:33:22.458Z, maybe this has something to do with this issue?

{"level":"info","time":"2024-10-30T10:07:25.023Z","logger":"machine-controller","caller":"machine/controller.go:945","msg":"Created machine at cloud provider","machine":"kube-system/sck-production-scc-zhw-pool-ubuntu-jammy-1-7b7ffdfdd8-2422k","provider":"openstack"}
[...]
{"level":"info","time":"2024-10-30T10:33:06.776Z","logger":"machine-controller","caller":"eviction/eviction.go:66","msg":"Starting to evict node","machine":"kube-system/sck-production-scc-zhw-pool-ubuntu-jammy-1-7b7ffdfdd8-2422k","provider":"openstack","node":"sck-production-scc-zhw-pool-ubuntu-jammy-1-7b7ffdfdd8-2422k"}
{"level":"info","time":"2024-10-30T10:33:20.579Z","logger":"machine-controller","caller":"eviction/eviction.go:66","msg":"Starting to evict node","machine":"kube-system/sck-production-scc-zhw-pool-ubuntu-jammy-1-7b7ffdfdd8-2422k","provider":"openstack","node":"sck-production-scc-zhw-pool-ubuntu-jammy-1-7b7ffdfdd8-2422k"}
{"level":"error","time":"2024-10-30T10:33:22.458Z","logger":"machine-controller","caller":"machine/controller.go:410","msg":"Reconciling failed","machine":"kube-system/sck-production-scc-zhw-pool-ubuntu-jammy-1-7b7ffdfdd8-2422k","error":"failed to get machine: Machine.cluster.k8s.io \"sck-production-scc-zhw-pool-ubuntu-jammy-1-7b7ffdfdd8-2422k\" not found"}

Now the OCCM last reported on this node at I1030 10:30:49.574364:

I1030 10:30:49.574364      12 instances.go:719] Node 'sck-production-scc-zhw-pool-ubuntu-jammy-1-7b7ffdfdd8-2422k' returns addresses '[{InternalIP 192.168.2.81}]'

And it last saw 10 worker nodes at I1030 10:33:22.428119 but only worked with 9 nodes at I1030 10:33:22.428217 already:

I1030 10:33:22.428081      12 controller.go:733] Syncing backends for all LB services.
I1030 10:33:22.428119      12 controller.go:812] Running updateLoadBalancerHosts(len(services)==12, workers==10)
I1030 10:33:22.428190      12 controller.go:770] nodeSyncService started for service cluster-hrs2xpk8sk/front-loadbalancer
I1030 10:33:22.428217      12 controller.go:842] Updating backends for load balancer cluster-hrs2xpk8sk/front-loadbalancer with 9 nodes: [sck-production-scc-zhw-pool-ubuntu-jammy-0-64b85f689c-79gpv sck-production-scc-zhw-pool-ubuntu-jammy-0-64b85f689c-qq9sd sck-production-scc-zhw-pool-ubuntu-jammy-0-64b85f689c-tb52d sck-production-scc-zhw-pool-ubuntu-jammy-0-64b85f689c-xmd8x sck-production-scc-zhw-pool-ubuntu-jammy-0-64b85f689c-zz9z8 sck-production-scc-zhw-pool-ubuntu-jammy-1-7b7ffdfdd8-bvnjz sck-production-scc-zhw-pool-ubuntu-jammy-1-7b7ffdfdd8-h7thw sck-production-scc-zhw-pool-ubuntu-jammy-1-7b7ffdfdd8-nc2zw sck-production-scc-zhw-pool-ubuntu-jammy-2-7c99d48579-xkjwx]

All looks good afaict, until it starts deleting security group rules here:

I1030 10:34:07.406719      12 loadbalancer_sg.go:316] Deleting rule 48c429f0-64db-42ef-b535-c11c54461d3b from security group 226cc832-0b14-4a1a-a281-db03b60285ba (lb-sg-aebe6c94-9925-4c60-a492-a97a01dba02c-cluster-imvcnfz7nf-front-loadbalancer)
I1030 10:34:07.485631      12 loadbalancer_sg.go:316] Deleting rule 7ecb6942-dd30-4b2a-ad1a-7f85a1099739 from security group 226cc832-0b14-4a1a-a281-db03b60285ba (lb-sg-aebe6c94-9925-4c60-a492-a97a01dba02c-cluster-imvcnfz7nf-front-loadbalancer)
I1030 10:34:07.565073      12 loadbalancer_sg.go:316] Deleting rule 835b6d3c-2f13-4244-bcc2-a006792fe858 from security group 226cc832-0b14-4a1a-a281-db03b60285ba (lb-sg-aebe6c94-9925-4c60-a492-a97a01dba02c-cluster-imvcnfz7nf-front-loadbalancer)

Unfortunately the log, even at -v=5 does not tell us why it deletes those rules (nor any details about the rule itself) at all :(

This is as far as I get, I don't understand why the OCCM does react like this.

judge-red commented 3 weeks ago

The log files are too large for GH and nobody likes zipped files, I think. So I apploaded the full logs of the mentioned containers here: https://drive.switch.ch/index.php/s/IMoXWTaJWrfdB5Q

Edit file names for the comment above are openstack-cloud-controller-manager-4xkm2.logs, cluster-autoscaler-75cf6b67b7-jmtdv.logs and machine-controller-55cb86757c-5p2ds.logs.

judge-red commented 3 weeks ago

@zetaab was kind enough to provide me with an image that adds this debug output: https://github.com/zetaab/cloud-provider-openstack/commit/caf8dd89bb814985acf8c719a9e688c8094214c3

And I finally found the time to confirm that this happens everytime the cluster-autoscaler removes a worker node. Alright, here are the findings.

After my last comment, all security group rules were missing. Of course changing the DaemonSet's image led to a restart of all pods. As previously observed, restarting the pods leads to the missing security group rules being re-created. Here's the log output of that:

I1030 13:37:39.140755      11 loadbalancer_sg.go:202] Wanted rules: [{Direction:ingress Description: EtherType:IPv4 SecGroupID:182e61bb-8b6e-483b-afd3-aee6b27f3f81 PortRangeMax:30769 PortRangeMin:30769 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 ProjectID:} {Direction:ingress Description: EtherType:IPv4 SecGroupID:182e61bb-8b6e-483b-afd3-aee6b27f3f81 PortRangeMax:30855 PortRangeMin:30855 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 ProjectID:}]
I1030 13:37:39.140812      11 loadbalancer_sg.go:203] Existing rules: []
I1030 13:37:39.140823      11 loadbalancer_sg.go:204] Rules to create: [{Direction:ingress Description: EtherType:IPv4 SecGroupID:182e61bb-8b6e-483b-afd3-aee6b27f3f81 PortRangeMax:30769 PortRangeMin:30769 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 ProjectID:} {Direction:ingress Description: EtherType:IPv4 SecGroupID:182e61bb-8b6e-483b-afd3-aee6b27f3f81 PortRangeMax:30855 PortRangeMin:30855 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 ProjectID:}]
I1030 13:37:39.140842      11 loadbalancer_sg.go:205] Rules to delete: []
I1030 13:37:39.434827      11 loadbalancer_sg.go:202] Wanted rules: [{Direction:ingress Description: EtherType:IPv4 SecGroupID:58e54338-18cf-4256-a2b0-3a5ca04a5645 PortRangeMax:31698 PortRangeMin:31698 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 ProjectID:} {Direction:ingress Description: EtherType:IPv4 SecGroupID:58e54338-18cf-4256-a2b0-3a5ca04a5645 PortRangeMax:31783 PortRangeMin:31783 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 ProjectID:} {Direction:ingress Description: EtherType:IPv4 SecGroupID:58e54338-18cf-4256-a2b0-3a5ca04a5645 PortRangeMax:32599 PortRangeMin:32599 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 ProjectID:}]
I1030 13:37:39.434889      11 loadbalancer_sg.go:203] Existing rules: []
I1030 13:37:39.434896      11 loadbalancer_sg.go:204] Rules to create: [{Direction:ingress Description: EtherType:IPv4 SecGroupID:58e54338-18cf-4256-a2b0-3a5ca04a5645 PortRangeMax:31698 PortRangeMin:31698 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 ProjectID:} {Direction:ingress Description: EtherType:IPv4 SecGroupID:58e54338-18cf-4256-a2b0-3a5ca04a5645 PortRangeMax:31783 PortRangeMin:31783 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 ProjectID:} {Direction:ingress Description: EtherType:IPv4 SecGroupID:58e54338-18cf-4256-a2b0-3a5ca04a5645 PortRangeMax:32599 PortRangeMin:32599 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 ProjectID:}]
I1030 13:37:39.434911      11 loadbalancer_sg.go:205] Rules to delete: []
I1030 13:37:39.486346      11 loadbalancer_sg.go:202] Wanted rules: [{Direction:ingress Description: EtherType:IPv4 SecGroupID:2f45b6ee-79ad-4241-a8cf-d98ff91868df PortRangeMax:31556 PortRangeMin:31556 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 ProjectID:} {Direction:ingress Description: EtherType:IPv4 SecGroupID:2f45b6ee-79ad-4241-a8cf-d98ff91868df PortRangeMax:32756 PortRangeMin:32756 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 ProjectID:} {Direction:ingress Description: EtherType:IPv4 SecGroupID:2f45b6ee-79ad-4241-a8cf-d98ff91868df PortRangeMax:31805 PortRangeMin:31805 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 ProjectID:}]
I1030 13:37:39.486399      11 loadbalancer_sg.go:203] Existing rules: []
I1030 13:37:39.486433      11 loadbalancer_sg.go:204] Rules to create: [{Direction:ingress Description: EtherType:IPv4 SecGroupID:2f45b6ee-79ad-4241-a8cf-d98ff91868df PortRangeMax:31556 PortRangeMin:31556 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 ProjectID:} {Direction:ingress Description: EtherType:IPv4 SecGroupID:2f45b6ee-79ad-4241-a8cf-d98ff91868df PortRangeMax:32756 PortRangeMin:32756 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 ProjectID:} {Direction:ingress Description: EtherType:IPv4 SecGroupID:2f45b6ee-79ad-4241-a8cf-d98ff91868df PortRangeMax:31805 PortRangeMin:31805 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 ProjectID:}]
I1030 13:37:39.486476      11 loadbalancer_sg.go:205] Rules to delete: []

Afterwards, the security groups and rules looked as follows:

$ for sg in $(os security group list -f csv --quote none | grep 'lb-sg' | awk -F, '{print $1}'); do echo; echo "Security Group: ${sg}"; os security group rule list "${sg}" -f value; done

Security Group: 182e61bb-8b6e-483b-afd3-aee6b27f3f81
1a6bf7d2-eeb2-4a67-84dd-4a04629c525c tcp IPv4 0.0.0.0/0 30855:30855 ingress None None
2b11287c-7190-4bec-b3a2-ee928fbd0fc5 tcp IPv4 0.0.0.0/0 30769:30769 ingress None None

Security Group: 2f45b6ee-79ad-4241-a8cf-d98ff91868df
73c36713-1f09-4d51-9650-41596a8d4933 tcp IPv4 0.0.0.0/0 31556:31556 ingress None None
8f3e3b96-f85f-444e-8cd8-f48f8c4ce75f tcp IPv4 0.0.0.0/0 31805:31805 ingress None None
a2582eda-39e9-4ab7-90da-d4cc475c99e1 tcp IPv4 0.0.0.0/0 32756:32756 ingress None None

Security Group: 58e54338-18cf-4256-a2b0-3a5ca04a5645
479625f5-d5b4-4b93-9f23-d4c29692d53b tcp IPv4 0.0.0.0/0 31783:31783 ingress None None
4f432173-7399-4555-996a-f86aaf82ffe1 tcp IPv4 0.0.0.0/0 31698:31698 ingress None None
6937899d-2d7d-423b-aa66-2df5eb1be9fa tcp IPv4 0.0.0.0/0 32599:32599 ingress None None

Perfect.

Now I caused the cluster-autoscaler to add a worker node (sck-staging-scc-zhw-pool-ubuntu-jammy-1-9644f5d57-b7kwj). To my surprise, the added node triggered the OCCM to first delete all security group rules and then to seemingly re-create them all.

I1030 13:47:55.960261      11 loadbalancer_sg.go:202] Wanted rules: []
I1030 13:47:55.960309      11 loadbalancer_sg.go:203] Existing rules: [{ID:1a6bf7d2-eeb2-4a67-84dd-4a04629c525c Direction:ingress Description: EtherType:IPv4 SecGroupID:182e61bb-8b6e-483b-afd3-aee6b27f3f81 PortRangeMin:30855 PortRangeMax:30855 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 TenantID:ea5499fcb7304c3c834603504a0a8760 ProjectID:ea5499fcb7304c3c834603504a0a8760} {ID:2b11287c-7190-4bec-b3a2-ee928fbd0fc5 Direction:ingress Description: EtherType:IPv4 SecGroupID:182e61bb-8b6e-483b-afd3-aee6b27f3f81 PortRangeMin:30769 PortRangeMax:30769 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 TenantID:ea5499fcb7304c3c834603504a0a8760 ProjectID:ea5499fcb7304c3c834603504a0a8760}]
I1030 13:47:55.960333      11 loadbalancer_sg.go:204] Rules to create: []
I1030 13:47:55.960342      11 loadbalancer_sg.go:205] Rules to delete: [{ID:1a6bf7d2-eeb2-4a67-84dd-4a04629c525c Direction:ingress Description: EtherType:IPv4 SecGroupID:182e61bb-8b6e-483b-afd3-aee6b27f3f81 PortRangeMin:30855 PortRangeMax:30855 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 TenantID:ea5499fcb7304c3c834603504a0a8760 ProjectID:ea5499fcb7304c3c834603504a0a8760} {ID:2b11287c-7190-4bec-b3a2-ee928fbd0fc5 Direction:ingress Description: EtherType:IPv4 SecGroupID:182e61bb-8b6e-483b-afd3-aee6b27f3f81 PortRangeMin:30769 PortRangeMax:30769 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 TenantID:ea5499fcb7304c3c834603504a0a8760 ProjectID:ea5499fcb7304c3c834603504a0a8760}]
I1030 13:48:00.484478      11 loadbalancer_sg.go:202] Wanted rules: []
I1030 13:48:00.484546      11 loadbalancer_sg.go:203] Existing rules: [{ID:479625f5-d5b4-4b93-9f23-d4c29692d53b Direction:ingress Description: EtherType:IPv4 SecGroupID:58e54338-18cf-4256-a2b0-3a5ca04a5645 PortRangeMin:31783 PortRangeMax:31783 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 TenantID:ea5499fcb7304c3c834603504a0a8760 ProjectID:ea5499fcb7304c3c834603504a0a8760} {ID:4f432173-7399-4555-996a-f86aaf82ffe1 Direction:ingress Description: EtherType:IPv4 SecGroupID:58e54338-18cf-4256-a2b0-3a5ca04a5645 PortRangeMin:31698 PortRangeMax:31698 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 TenantID:ea5499fcb7304c3c834603504a0a8760 ProjectID:ea5499fcb7304c3c834603504a0a8760} {ID:6937899d-2d7d-423b-aa66-2df5eb1be9fa Direction:ingress Description: EtherType:IPv4 SecGroupID:58e54338-18cf-4256-a2b0-3a5ca04a5645 PortRangeMin:32599 PortRangeMax:32599 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 TenantID:ea5499fcb7304c3c834603504a0a8760 ProjectID:ea5499fcb7304c3c834603504a0a8760}]
I1030 13:48:00.484577      11 loadbalancer_sg.go:204] Rules to create: []
I1030 13:48:00.484584      11 loadbalancer_sg.go:205] Rules to delete: [{ID:479625f5-d5b4-4b93-9f23-d4c29692d53b Direction:ingress Description: EtherType:IPv4 SecGroupID:58e54338-18cf-4256-a2b0-3a5ca04a5645 PortRangeMin:31783 PortRangeMax:31783 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 TenantID:ea5499fcb7304c3c834603504a0a8760 ProjectID:ea5499fcb7304c3c834603504a0a8760} {ID:4f432173-7399-4555-996a-f86aaf82ffe1 Direction:ingress Description: EtherType:IPv4 SecGroupID:58e54338-18cf-4256-a2b0-3a5ca04a5645 PortRangeMin:31698 PortRangeMax:31698 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 TenantID:ea5499fcb7304c3c834603504a0a8760 ProjectID:ea5499fcb7304c3c834603504a0a8760} {ID:6937899d-2d7d-423b-aa66-2df5eb1be9fa Direction:ingress Description: EtherType:IPv4 SecGroupID:58e54338-18cf-4256-a2b0-3a5ca04a5645 PortRangeMin:32599 PortRangeMax:32599 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 TenantID:ea5499fcb7304c3c834603504a0a8760 ProjectID:ea5499fcb7304c3c834603504a0a8760}]
I1030 13:48:01.231827      11 loadbalancer_sg.go:202] Wanted rules: []
I1030 13:48:01.231909      11 loadbalancer_sg.go:203] Existing rules: [{ID:73c36713-1f09-4d51-9650-41596a8d4933 Direction:ingress Description: EtherType:IPv4 SecGroupID:2f45b6ee-79ad-4241-a8cf-d98ff91868df PortRangeMin:31556 PortRangeMax:31556 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 TenantID:ea5499fcb7304c3c834603504a0a8760 ProjectID:ea5499fcb7304c3c834603504a0a8760} {ID:8f3e3b96-f85f-444e-8cd8-f48f8c4ce75f Direction:ingress Description: EtherType:IPv4 SecGroupID:2f45b6ee-79ad-4241-a8cf-d98ff91868df PortRangeMin:31805 PortRangeMax:31805 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 TenantID:ea5499fcb7304c3c834603504a0a8760 ProjectID:ea5499fcb7304c3c834603504a0a8760} {ID:a2582eda-39e9-4ab7-90da-d4cc475c99e1 Direction:ingress Description: EtherType:IPv4 SecGroupID:2f45b6ee-79ad-4241-a8cf-d98ff91868df PortRangeMin:32756 PortRangeMax:32756 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 TenantID:ea5499fcb7304c3c834603504a0a8760 ProjectID:ea5499fcb7304c3c834603504a0a8760}]
I1030 13:48:01.231959      11 loadbalancer_sg.go:204] Rules to create: []
I1030 13:48:01.232072      11 loadbalancer_sg.go:205] Rules to delete: [{ID:73c36713-1f09-4d51-9650-41596a8d4933 Direction:ingress Description: EtherType:IPv4 SecGroupID:2f45b6ee-79ad-4241-a8cf-d98ff91868df PortRangeMin:31556 PortRangeMax:31556 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 TenantID:ea5499fcb7304c3c834603504a0a8760 ProjectID:ea5499fcb7304c3c834603504a0a8760} {ID:8f3e3b96-f85f-444e-8cd8-f48f8c4ce75f Direction:ingress Description: EtherType:IPv4 SecGroupID:2f45b6ee-79ad-4241-a8cf-d98ff91868df PortRangeMin:31805 PortRangeMax:31805 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 TenantID:ea5499fcb7304c3c834603504a0a8760 ProjectID:ea5499fcb7304c3c834603504a0a8760} {ID:a2582eda-39e9-4ab7-90da-d4cc475c99e1 Direction:ingress Description: EtherType:IPv4 SecGroupID:2f45b6ee-79ad-4241-a8cf-d98ff91868df PortRangeMin:32756 PortRangeMax:32756 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 TenantID:ea5499fcb7304c3c834603504a0a8760 ProjectID:ea5499fcb7304c3c834603504a0a8760}]
I1030 13:48:03.260913      11 loadbalancer_sg.go:202] Wanted rules: [{Direction:ingress Description: EtherType:IPv4 SecGroupID:182e61bb-8b6e-483b-afd3-aee6b27f3f81 PortRangeMax:30769 PortRangeMin:30769 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 ProjectID:} {Direction:ingress Description: EtherType:IPv4 SecGroupID:182e61bb-8b6e-483b-afd3-aee6b27f3f81 PortRangeMax:30855 PortRangeMin:30855 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 ProjectID:}]
I1030 13:48:03.261024      11 loadbalancer_sg.go:203] Existing rules: []
I1030 13:48:03.261040      11 loadbalancer_sg.go:204] Rules to create: [{Direction:ingress Description: EtherType:IPv4 SecGroupID:182e61bb-8b6e-483b-afd3-aee6b27f3f81 PortRangeMax:30769 PortRangeMin:30769 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 ProjectID:} {Direction:ingress Description: EtherType:IPv4 SecGroupID:182e61bb-8b6e-483b-afd3-aee6b27f3f81 PortRangeMax:30855 PortRangeMin:30855 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 ProjectID:}]
I1030 13:48:03.261100      11 loadbalancer_sg.go:205] Rules to delete: []
I1030 13:48:03.449185      11 loadbalancer_sg.go:202] Wanted rules: []
I1030 13:48:03.449268      11 loadbalancer_sg.go:203] Existing rules: [{ID:dffb4db5-0071-4ef2-935e-51d12b7d6752 Direction:ingress Description: EtherType:IPv4 SecGroupID:182e61bb-8b6e-483b-afd3-aee6b27f3f81 PortRangeMin:30855 PortRangeMax:30855 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 TenantID:ea5499fcb7304c3c834603504a0a8760 ProjectID:ea5499fcb7304c3c834603504a0a8760} {ID:e20f100b-4c74-4d56-885a-4c5239864cb0 Direction:ingress Description: EtherType:IPv4 SecGroupID:182e61bb-8b6e-483b-afd3-aee6b27f3f81 PortRangeMin:30769 PortRangeMax:30769 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 TenantID:ea5499fcb7304c3c834603504a0a8760 ProjectID:ea5499fcb7304c3c834603504a0a8760}]
I1030 13:48:03.449316      11 loadbalancer_sg.go:204] Rules to create: []
I1030 13:48:03.449352      11 loadbalancer_sg.go:205] Rules to delete: [{ID:dffb4db5-0071-4ef2-935e-51d12b7d6752 Direction:ingress Description: EtherType:IPv4 SecGroupID:182e61bb-8b6e-483b-afd3-aee6b27f3f81 PortRangeMin:30855 PortRangeMax:30855 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 TenantID:ea5499fcb7304c3c834603504a0a8760 ProjectID:ea5499fcb7304c3c834603504a0a8760} {ID:e20f100b-4c74-4d56-885a-4c5239864cb0 Direction:ingress Description: EtherType:IPv4 SecGroupID:182e61bb-8b6e-483b-afd3-aee6b27f3f81 PortRangeMin:30769 PortRangeMax:30769 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 TenantID:ea5499fcb7304c3c834603504a0a8760 ProjectID:ea5499fcb7304c3c834603504a0a8760}]
I1030 13:48:03.645221      11 loadbalancer_sg.go:202] Wanted rules: []
I1030 13:48:03.645236      11 loadbalancer_sg.go:203] Existing rules: []
I1030 13:48:03.645241      11 loadbalancer_sg.go:204] Rules to create: []
I1030 13:48:03.645244      11 loadbalancer_sg.go:205] Rules to delete: []
I1030 13:48:03.787522      11 loadbalancer_sg.go:202] Wanted rules: [{Direction:ingress Description: EtherType:IPv4 SecGroupID:2f45b6ee-79ad-4241-a8cf-d98ff91868df PortRangeMax:31556 PortRangeMin:31556 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 ProjectID:} {Direction:ingress Description: EtherType:IPv4 SecGroupID:2f45b6ee-79ad-4241-a8cf-d98ff91868df PortRangeMax:32756 PortRangeMin:32756 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 ProjectID:} {Direction:ingress Description: EtherType:IPv4 SecGroupID:2f45b6ee-79ad-4241-a8cf-d98ff91868df PortRangeMax:31805 PortRangeMin:31805 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 ProjectID:}]
I1030 13:48:03.787579      11 loadbalancer_sg.go:203] Existing rules: []
I1030 13:48:03.787589      11 loadbalancer_sg.go:204] Rules to create: [{Direction:ingress Description: EtherType:IPv4 SecGroupID:2f45b6ee-79ad-4241-a8cf-d98ff91868df PortRangeMax:31556 PortRangeMin:31556 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 ProjectID:} {Direction:ingress Description: EtherType:IPv4 SecGroupID:2f45b6ee-79ad-4241-a8cf-d98ff91868df PortRangeMax:32756 PortRangeMin:32756 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 ProjectID:} {Direction:ingress Description: EtherType:IPv4 SecGroupID:2f45b6ee-79ad-4241-a8cf-d98ff91868df PortRangeMax:31805 PortRangeMin:31805 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 ProjectID:}]
I1030 13:48:03.787691      11 loadbalancer_sg.go:205] Rules to delete: []
I1030 13:48:03.906847      11 loadbalancer_sg.go:202] Wanted rules: []
I1030 13:48:03.906878      11 loadbalancer_sg.go:203] Existing rules: []
I1030 13:48:03.906884      11 loadbalancer_sg.go:204] Rules to create: []
I1030 13:48:03.906888      11 loadbalancer_sg.go:205] Rules to delete: []
I1030 13:48:04.149560      11 loadbalancer_sg.go:202] Wanted rules: [{Direction:ingress Description: EtherType:IPv4 SecGroupID:58e54338-18cf-4256-a2b0-3a5ca04a5645 PortRangeMax:31698 PortRangeMin:31698 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 ProjectID:} {Direction:ingress Description: EtherType:IPv4 SecGroupID:58e54338-18cf-4256-a2b0-3a5ca04a5645 PortRangeMax:31783 PortRangeMin:31783 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 ProjectID:} {Direction:ingress Description: EtherType:IPv4 SecGroupID:58e54338-18cf-4256-a2b0-3a5ca04a5645 PortRangeMax:32599 PortRangeMin:32599 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 ProjectID:}]
I1030 13:48:04.149655      11 loadbalancer_sg.go:203] Existing rules: []
I1030 13:48:04.149686      11 loadbalancer_sg.go:204] Rules to create: [{Direction:ingress Description: EtherType:IPv4 SecGroupID:58e54338-18cf-4256-a2b0-3a5ca04a5645 PortRangeMax:31698 PortRangeMin:31698 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 ProjectID:} {Direction:ingress Description: EtherType:IPv4 SecGroupID:58e54338-18cf-4256-a2b0-3a5ca04a5645 PortRangeMax:31783 PortRangeMin:31783 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 ProjectID:} {Direction:ingress Description: EtherType:IPv4 SecGroupID:58e54338-18cf-4256-a2b0-3a5ca04a5645 PortRangeMax:32599 PortRangeMin:32599 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 ProjectID:}]
I1030 13:48:04.149729      11 loadbalancer_sg.go:205] Rules to delete: []

This is a concern, as this means that the LBs were briefly inaccessible. This time it was just a few seconds, but there are only 3 LBs. We expect to have dozens of LBs in a cluster in future.

Checking the SGs and SG rules now reveals something even worse: apparently not every SGs rules were re-created after deletion:

$ for sg in $(os security group list -f csv --quote none | grep 'lb-sg' | awk -F, '{print $1}'); do echo; echo "Security Group: ${sg}"; os security group rule list "${sg}" -f value; done

Security Group: 182e61bb-8b6e-483b-afd3-aee6b27f3f81

Security Group: 2f45b6ee-79ad-4241-a8cf-d98ff91868df
3b974375-5461-4a08-a436-80b2f1ea2e13 tcp IPv4 0.0.0.0/0 31805:31805 ingress None None
596c3dc3-0a46-48db-a089-fbd8ce6a1d35 tcp IPv4 0.0.0.0/0 32756:32756 ingress None None
d00458bd-8b92-47aa-8501-a09e9ce1af06 tcp IPv4 0.0.0.0/0 31556:31556 ingress None None

Security Group: 58e54338-18cf-4256-a2b0-3a5ca04a5645
069a6bd1-76e5-4415-ac1a-b7ddbbbd76ef tcp IPv4 0.0.0.0/0 31783:31783 ingress None None
b2ad411f-a88a-437e-9f0d-1a6f3d7fa0e2 tcp IPv4 0.0.0.0/0 31698:31698 ingress None None
ccf19fe1-5969-4468-a51b-b232bd10994d tcp IPv4 0.0.0.0/0 32599:32599 ingress None None

I'm assuming this is the same bug that this issue is about, but both these findings are new and the issue title is not fully correct anymore as not all rules (of all SGs) were removed.

Anyway, I caused the cluster-autoscaler to remove the same node sck-staging-scc-zhw-pool-ubuntu-jammy-1-9644f5d57-b7kwj again.

I1030 14:17:27.352636      11 loadbalancer_sg.go:202] Wanted rules: []
I1030 14:17:27.352657      11 loadbalancer_sg.go:203] Existing rules: []
I1030 14:17:27.352664      11 loadbalancer_sg.go:204] Rules to create: []
I1030 14:17:27.352668      11 loadbalancer_sg.go:205] Rules to delete: []
I1030 14:17:35.110437      11 loadbalancer_sg.go:202] Wanted rules: []
I1030 14:17:35.110460      11 loadbalancer_sg.go:203] Existing rules: [{ID:069a6bd1-76e5-4415-ac1a-b7ddbbbd76ef Direction:ingress Description: EtherType:IPv4 SecGroupID:58e54338-18cf-4256-a2b0-3a5ca04a5645 PortRangeMin:31783 PortRangeMax:31783 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 TenantID:ea5499fcb7304c3c834603504a0a8760 ProjectID:ea5499fcb7304c3c834603504a0a8760} {ID:b2ad411f-a88a-437e-9f0d-1a6f3d7fa0e2 Direction:ingress Description: EtherType:IPv4 SecGroupID:58e54338-18cf-4256-a2b0-3a5ca04a5645 PortRangeMin:31698 PortRangeMax:31698 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 TenantID:ea5499fcb7304c3c834603504a0a8760 ProjectID:ea5499fcb7304c3c834603504a0a8760} {ID:ccf19fe1-5969-4468-a51b-b232bd10994d Direction:ingress Description: EtherType:IPv4 SecGroupID:58e54338-18cf-4256-a2b0-3a5ca04a5645 PortRangeMin:32599 PortRangeMax:32599 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 TenantID:ea5499fcb7304c3c834603504a0a8760 ProjectID:ea5499fcb7304c3c834603504a0a8760}]
I1030 14:17:35.110573      11 loadbalancer_sg.go:204] Rules to create: []
I1030 14:17:35.110860      11 loadbalancer_sg.go:205] Rules to delete: [{ID:069a6bd1-76e5-4415-ac1a-b7ddbbbd76ef Direction:ingress Description: EtherType:IPv4 SecGroupID:58e54338-18cf-4256-a2b0-3a5ca04a5645 PortRangeMin:31783 PortRangeMax:31783 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 TenantID:ea5499fcb7304c3c834603504a0a8760 ProjectID:ea5499fcb7304c3c834603504a0a8760} {ID:b2ad411f-a88a-437e-9f0d-1a6f3d7fa0e2 Direction:ingress Description: EtherType:IPv4 SecGroupID:58e54338-18cf-4256-a2b0-3a5ca04a5645 PortRangeMin:31698 PortRangeMax:31698 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 TenantID:ea5499fcb7304c3c834603504a0a8760 ProjectID:ea5499fcb7304c3c834603504a0a8760} {ID:ccf19fe1-5969-4468-a51b-b232bd10994d Direction:ingress Description: EtherType:IPv4 SecGroupID:58e54338-18cf-4256-a2b0-3a5ca04a5645 PortRangeMin:32599 PortRangeMax:32599 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 TenantID:ea5499fcb7304c3c834603504a0a8760 ProjectID:ea5499fcb7304c3c834603504a0a8760}]
I1030 14:17:39.110242      11 loadbalancer_sg.go:202] Wanted rules: []
I1030 14:17:39.110332      11 loadbalancer_sg.go:203] Existing rules: [{ID:3b974375-5461-4a08-a436-80b2f1ea2e13 Direction:ingress Description: EtherType:IPv4 SecGroupID:2f45b6ee-79ad-4241-a8cf-d98ff91868df PortRangeMin:31805 PortRangeMax:31805 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 TenantID:ea5499fcb7304c3c834603504a0a8760 ProjectID:ea5499fcb7304c3c834603504a0a8760} {ID:596c3dc3-0a46-48db-a089-fbd8ce6a1d35 Direction:ingress Description: EtherType:IPv4 SecGroupID:2f45b6ee-79ad-4241-a8cf-d98ff91868df PortRangeMin:32756 PortRangeMax:32756 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 TenantID:ea5499fcb7304c3c834603504a0a8760 ProjectID:ea5499fcb7304c3c834603504a0a8760} {ID:d00458bd-8b92-47aa-8501-a09e9ce1af06 Direction:ingress Description: EtherType:IPv4 SecGroupID:2f45b6ee-79ad-4241-a8cf-d98ff91868df PortRangeMin:31556 PortRangeMax:31556 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 TenantID:ea5499fcb7304c3c834603504a0a8760 ProjectID:ea5499fcb7304c3c834603504a0a8760}]
I1030 14:17:39.110383      11 loadbalancer_sg.go:204] Rules to create: []
I1030 14:17:39.110455      11 loadbalancer_sg.go:205] Rules to delete: [{ID:3b974375-5461-4a08-a436-80b2f1ea2e13 Direction:ingress Description: EtherType:IPv4 SecGroupID:2f45b6ee-79ad-4241-a8cf-d98ff91868df PortRangeMin:31805 PortRangeMax:31805 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 TenantID:ea5499fcb7304c3c834603504a0a8760 ProjectID:ea5499fcb7304c3c834603504a0a8760} {ID:596c3dc3-0a46-48db-a089-fbd8ce6a1d35 Direction:ingress Description: EtherType:IPv4 SecGroupID:2f45b6ee-79ad-4241-a8cf-d98ff91868df PortRangeMin:32756 PortRangeMax:32756 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 TenantID:ea5499fcb7304c3c834603504a0a8760 ProjectID:ea5499fcb7304c3c834603504a0a8760} {ID:d00458bd-8b92-47aa-8501-a09e9ce1af06 Direction:ingress Description: EtherType:IPv4 SecGroupID:2f45b6ee-79ad-4241-a8cf-d98ff91868df PortRangeMin:31556 PortRangeMax:31556 Protocol:tcp RemoteGroupID: RemoteIPPrefix:0.0.0.0/0 TenantID:ea5499fcb7304c3c834603504a0a8760 ProjectID:ea5499fcb7304c3c834603504a0a8760}]

Not sure what this first "group" of outputs was about, where wanted/existing/create/delete were all []. But afterwards all SG rules were deleted again, but in this case they're not re-created again. So little surprisingly, our SGs and SG rules now look as follows:

$ for sg in $(os security group list -f csv --quote none | grep 'lb-sg' | awk -F, '{print $1}'); do echo; echo "Security Group: ${sg}"; os security group rule list "${sg}" -f value; done

Security Group: 182e61bb-8b6e-483b-afd3-aee6b27f3f81

Security Group: 2f45b6ee-79ad-4241-a8cf-d98ff91868df

Security Group: 58e54338-18cf-4256-a2b0-3a5ca04a5645

In other words: no SG has any rules anymore.

However, I'm none the wiser why they are deleted at all, or why they're not re-created. Restarting the DaemonSet will recreate them again, I'm sure, as seen at the beginning of this comment.

Edit: the other debug log line, NodePort not found, was not logged once.

Edit 2: I've added the log where the above snippets come from to the previously linked folder. Name openstack-cloud-controller-manager-spdkb.logs.

zetaab commented 3 weeks ago

I cannot reproduce this with octavia + haproxy (amphora). However, I see another bug. occm assumes that we do have only one subnet where all nodes are running. In our case we have 3 subnets.