IPAM / Ingress Controller keep deleting each others VIPs causing service disruptions

lhw commented 3 years ago

Describe the bug An address allocated for a loadbalancer will be created and deleted dozens of times a second by the interaction between the ipam and ingress controller

To Reproduce

Add service with loadbalancer type
Wait for provisioning
Chaos ensues

Expected behavior

Not spam millions of lines of log and commands.
Configure IP address.

And I almost consider the last point optional at this point

Logs kubectl logs: https://gist.github.com/lhw/76ef70823251bea2db202d51de951f07

Kubernetes service:

intern-nginx-ingress                                 LoadBalancer   10.233.15.65    172.31.203.225   80:31963/TCP,443:31732/TCP     41s
intern-nginx-ingress                                 LoadBalancer   10.233.15.65    <pending>        80:31963/TCP,443:31732/TCP     41s
intern-nginx-ingress                                 LoadBalancer   10.233.15.65    172.31.203.225   80:31963/TCP,443:31732/TCP     41s
intern-nginx-ingress                                 LoadBalancer   10.233.15.65    <pending>        80:31963/TCP,443:31732/TCP     41s
intern-nginx-ingress                                 LoadBalancer   10.233.15.65    172.31.203.225   80:31963/TCP,443:31732/TCP     43s
intern-nginx-ingress                                 LoadBalancer   10.233.15.65    <pending>        80:31963/TCP,443:31732/TCP     43s
intern-nginx-ingress                                 LoadBalancer   10.233.15.65    172.31.203.225   80:31963/TCP,443:31732/TCP     43s
intern-nginx-ingress                                 LoadBalancer   10.233.15.65    <pending>        80:31963/TCP,443:31732/TCP     43s
intern-nginx-ingress                                 LoadBalancer   10.233.15.65    172.31.203.225   80:31963/TCP,443:31732/TCP     44s
intern-nginx-ingress                                 LoadBalancer   10.233.15.65    <pending>        80:31963/TCP,443:31732/TCP     44s
intern-nginx-ingress                                 LoadBalancer   10.233.15.65    172.31.203.225   80:31963/TCP,443:31732/TCP     44s
intern-nginx-ingress                                 LoadBalancer   10.233.15.65    <pending>        80:31963/TCP,443:31732/TCP     44s
intern-nginx-ingress                                 LoadBalancer   10.233.15.65    172.31.203.225   80:31963/TCP,443:31732/TCP     46s
intern-nginx-ingress                                 LoadBalancer   10.233.15.65    <pending>        80:31963/TCP,443:31732/TCP     46s
intern-nginx-ingress                                 LoadBalancer   10.233.15.65    172.31.203.225   80:31963/TCP,443:31732/TCP     46s
intern-nginx-ingress                                 LoadBalancer   10.233.15.65    <pending>        80:31963/TCP,443:31732/TCP     46s
intern-nginx-ingress                                 LoadBalancer   10.233.15.65    172.31.203.225   80:31963/TCP,443:31732/TCP     47s
intern-nginx-ingress                                 LoadBalancer   10.233.15.65    <pending>        80:31963/TCP,443:31732/TCP     47s
intern-nginx-ingress                                 LoadBalancer   10.233.15.65    172.31.203.225   80:31963/TCP,443:31732/TCP     47s
intern-nginx-ingress                                 LoadBalancer   10.233.15.65    <pending>        80:31963/TCP,443:31732/TCP     47s
intern-nginx-ingress                                 LoadBalancer   10.233.15.65    172.31.203.225   80:31963/TCP,443:31732/TCP     49s
intern-nginx-ingress                                 LoadBalancer   10.233.15.65    <pending>        80:31963/TCP,443:31732/TCP     49s

Additional context Add any other context about the problem here.

# helm values cic chart
adcCredentialSecret: netscaler-login
crds:
  install: true
  retainOnDelete: true
defaultSSLCertSecret: pki-default-certificate
entityPrefix: stg
exporter:
  required: false
image: quay.io/citrix/citrix-k8s-ingress-controller:1.13.15
ingressClass:
- intern
ipam: true
license:
  accept: "yes"
logLevel: info
nodeSelector:
  key: node-role.kubernetes.io/master
  value: ""
nodeWatch: false
nsHTTP2ServerSide: "ON"
nsIP: 172.31.102.23
nsProtocol: https
nsVIP: 172.31.203.22
podIPsforServiceGroupMembers: true
serviceClass:
- intern
tolerations:
- effect: NoSchedule
  key: node-role.kubernetes.io/master
  operator: Equal
updateIngressStatus: true

lhw commented 3 years ago

Deleting the loadbalancing service will not cause the endless loop to stop. Instead it will pop up a few more errors and continue:

2021-05-11 14:28:43,135  - INFO - [nitrointerface.py:_delete_nsapp_cs_vserver:1782] (MainThread) csvserver stg-intern-nginx-ingress_80_kube-system_svc is deleted successfully
2021-05-11 14:28:43,135  - INFO - [nitrointerface.py:delete_nsapp:3763] (MainThread) Deleting application: stg-intern-nginx-ingress_443_kube-system_svc  LB Role: server
2021-05-11 14:28:43,203  - INFO - [nitrointerface.py:_unbind_default_cs_policy:3204] (MainThread) stg-intern-nginx-ingress_443_lbv_jzp5b6b4tfpli5qcnbgtl7v3zmqtzjzk lbvserver unbind from stg-intern-nginx-ingress_443_kube-system_svc csvserver is successful
2021-05-11 14:28:43,242  - INFO - [nitrointerface.py:_delete_nsapp_service_group:1488] (MainThread) servicegroup stg-intern-nginx-ingress_443_sgp_jzp5b6b4tfpli5qcnbgtl7v3zmqtzjzk is deleted successfully
2021-05-11 14:28:43,291  - INFO - [referencemanager.py:process_unmanaged_delete_event:1053] (MainThread) Deleting Unmanaged entity: kube-system.lbvserver.intern-nginx-ingress - stg-intern-nginx-ingress_443_lbv_jzp5b6b4tfpli5qcnbgtl7v3zmqtzjzk
2021-05-11 14:28:43,334  - INFO - [nitrointerface.py:_delete_nsapp_vserver:1258] (MainThread) LBvserver stg-intern-nginx-ingress_443_lbv_jzp5b6b4tfpli5qcnbgtl7v3zmqtzjzk is deleted successfully
2021-05-11 14:28:43,334  - INFO - [referencemanager.py:process_unmanaged_delete_event:1053] (MainThread) Deleting Unmanaged entity: kube-system.csvserver_lbsvc.intern-nginx-ingress - stg-intern-nginx-ingress_443_kube-system_svc
2021-05-11 14:28:43,425  - INFO - [nitrointerface.py:_delete_nsapp_cs_vserver:1782] (MainThread) csvserver stg-intern-nginx-ingress_443_kube-system_svc is deleted successfully
2021-05-11 14:28:43,436  - INFO - [clienthelper.py:get:49] (MainThread) Resource not found: /services/intern-nginx-ingress namespace kube-system
2021-05-11 14:28:43,436  - ERROR - [customresourcecontroller.py:event_handler:232] (MainThread) FAILURE: DELIVERING CRD event: Exception "local variable 'crd_name' referenced before assignment" while handling event for crd service-intern-nginx-ingress.kube-system of kind vip
2021-05-11 14:28:43,453  - INFO - [clienthelper.py:get:49] (MainThread) Resource not found: /endpoints/intern-nginx-ingress namespace kube-system
2021-05-11 14:28:43,453  - INFO - [kubernetes.py:get_endpoints_for_service:2541] (MainThread) Failed to get endpoints list for the app intern-nginx-ingress
2021-05-11 14:28:43,454  - INFO - [kubernetes.py:update_cpx_for_apps:4410] (MainThread) Handling Type LoadBalancer Service Modification intern-nginx-ingress.kube-system
2021-05-11 14:28:43,454  - INFO - [kubernetes.py:kubernetes_service_to_nsapps:2758] (MainThread) Handling Service creation/Modification intern-nginx-ingress.kube-system
2021-05-11 14:28:43,454  - INFO - [kubernetes.py:kubernetes_service_to_nsapps:2991] (MainThread) Configuring Type LoadBalancer Service intern-nginx-ingress:kube-system port params:{'name': 'http', 'protocol': 'tcp', 'port': 80, 'targetPort': 80, 'nodePort': 31219, 'vip': '172.31.203.231', 'com/class': 'intern', 'stylebook': None, 'sslcert': {}, 'range-name': None, 'stylebook_params': {}, 'stylebook_service_params': {}}
2021-05-11 14:28:43,454  - INFO - [kubernetes.py:kubernetes_service_to_nsapps:2997] (MainThread) Updating the LoadBalancer service kube-system:intern-nginx-ingress status with IP:172.31.203.231
2021-05-11 14:28:43,465  - INFO - [clienthelper.py:patch:73] (MainThread) Got status code 404, Resource not found: API: /services/intern-nginx-ingress/status namespace kube-system
2021-05-11 14:28:43,487  - INFO - [clienthelper.py:post:100] (MainThread) Got status code 409, Resource already exists request api: /vips namespace: kube-system, no action needed
2021-05-11 14:28:43,487  - INFO - [kubernetes.py:kubernetes_service_to_nsapps:2991] (MainThread) Configuring Type LoadBalancer Service intern-nginx-ingress:kube-system port params:{'name': 'https', 'protocol': 'tcp', 'port': 443, 'targetPort': 443, 'nodePort': 31755, 'vip': '172.31.203.231', 'com/class': 'intern', 'stylebook': None, 'sslcert': {}, 'range-name': None, 'stylebook_params': {}, 'stylebook_service_params': {}}
2021-05-11 14:28:43,488  - INFO - [kubernetes.py:kubernetes_service_to_nsapps:2997] (MainThread) Updating the LoadBalancer service kube-system:intern-nginx-ingress status with IP:172.31.203.231
2021-05-11 14:28:43,497  - INFO - [clienthelper.py:patch:73] (MainThread) Got status code 404, Resource not found: API: /services/intern-nginx-ingress/status namespace kube-system
2021-05-11 14:28:43,515  - INFO - [clienthelper.py:post:100] (MainThread) Got status code 409, Resource already exists request api: /vips namespace: kube-system, no action needed
2021-05-11 14:28:43,516  - INFO - [nitrointerface.py:configure_ns_cs_app:3614] (MainThread) Configuring csvserver: stg-intern-nginx-ingress_80_kube-system_svc and associated services
2021-05-11 14:28:43,585  - INFO - [nitrointerface.py:_create_nsapp_cs_vserver:2725] (MainThread) csvserver stg-intern-nginx-ingress_80_kube-system_svc is created successfully
2021-05-11 14:28:43,585  - INFO - [referencemanager.py:process_unmanaged_add_event:1013] (MainThread) Adding unmanaged entity: kube-system.csvserver_lbsvc.intern-nginx-ingress - stg-intern-nginx-ingress_80_kube-system_svc
2021-05-11 14:28:43,586  - INFO - [nitrointerface.py:create_entities_for_policy:1834] (MainThread) Processing lbvserver:stg-intern-nginx-ingress_80_lbv_3ptbzysfiwgrnie2hic3wxvozioiloaq for csvserver:stg-intern-nginx-ingress_80_kube-system_svc service type for lbvserver: tcp service type for servicegroup:tcp
2021-05-11 14:28:43,648  - INFO - [nitrointerface.py:_create_nsapp_vserver:1237] (MainThread) lbvserver stg-intern-nginx-ingress_80_lbv_3ptbzysfiwgrnie2hic3wxvozioiloaq is created successfully
2021-05-11 14:28:43,767  - INFO - [nitrointerface.py:_bind_default_cs_policy:3231] (MainThread) csvserver stg-intern-nginx-ingress_80_kube-system_svc binding to lbvserver stg-intern-nginx-ingress_80_lbv_3ptbzysfiwgrnie2hic3wxvozioiloaq as default policy is successful
2021-05-11 14:28:43,839  - INFO - [nitrointerface.py:_create_nsapp_service_group:1452] (MainThread) Servicegroup stg-intern-nginx-ingress_80_sgp_3ptbzysfiwgrnie2hic3wxvozioiloaq is created successfully
2021-05-11 14:28:43,904  - INFO - [nitrointerface.py:_bind_service_group_lb:1536] (MainThread) servicegroup stg-intern-nginx-ingress_80_sgp_3ptbzysfiwgrnie2hic3wxvozioiloaq bind to lbvserver stg-intern-nginx-ingress_80_lbv_3ptbzysfiwgrnie2hic3wxvozioiloaq is successful
2021-05-11 14:28:43,965  - INFO - [nitrointerface.py:_configure_services_nondesired:1735] (MainThread) Binding 172.31.102.128:31219 from servicegroup stg-intern-nginx-ingress_80_sgp_3ptbzysfiwgrnie2hic3wxvozioiloaq is successful
2021-05-11 14:28:44,008  - INFO - [nitrointerface.py:_configure_services_nondesired:1735] (MainThread) Binding 172.31.102.125:31219 from servicegroup stg-intern-nginx-ingress_80_sgp_3ptbzysfiwgrnie2hic3wxvozioiloaq is successful
2021-05-11 14:28:44,009  - INFO - [referencemanager.py:process_unmanaged_add_event:1013] (MainThread) Adding unmanaged entity: kube-system.lbvserver.intern-nginx-ingress - stg-intern-nginx-ingress_80_lbv_3ptbzysfiwgrnie2hic3wxvozioiloaq
2021-05-11 14:28:44,009  - INFO - [nitrointerface.py:configure_ns_cs_app:3654] (MainThread) Finished processing instruction to configure stg-intern-nginx-ingress_80_kube-system_svc app associated with stg-intern-nginx-ingress_80_kube-system_svc csvserver

apoorvak-citrix commented 3 years ago

@lhw We are trying to reproduce and find the root cause the issue. We will get back to you on this.

apoorvak-citrix commented 3 years ago

@lhw unfortunately we were not able to reproduce this issue. We would want some more details about this issue from you.

Is it possible for you to share the IPAM logs during this duration.
Is this the complete CIC log during the given timeframe or is it filtered?
If you still have the VIP CRD resource in your cluster can that be shared.
```
kubectl get vip --all-namespaces
``
```

lhw commented 3 years ago

I recreated the issue for you:

Is it possible for you to share the IPAM logs during this duration.

Here the complete log for the time period: https://gist.github.com/lhw/3e998cda187c17ce8bd08ef3ebf1e09d under cic.log. Luckily only around 1900 lines for the minute.

Is this the complete CIC log during the given timeframe or is it filtered?

It wasn't filtered. The gist link also includes the cic-ipam.log

If you still have the VIP CRD resource in your cluster can that be shared.

The gist link also contains the vip yaml that i was able to grasp before it was deleted again.

apoorvak-citrix commented 3 years ago

@lhw Just clarifying a few things:

Is there only one instance of Citrix Ingress Controller and IPAM running in the cluster?
Highly unlikely, but is there a daemon process or workload monitor which might be deleting the VIP resources created by the ingress controller in the kube-system namespace?

lhw commented 3 years ago

Is there only one instance of Citrix Ingress Controller and IPAM running in the cluster?

The issue is present on two clusters. But all clusters we have do have more than one ingress controller. The one from the log has three: The helm values for all three are here: https://gist.github.com/lhw/5a2c52260620ef6f4106b4a7f75417cb each has its own service-class though.

Highly unlikely, but is there a daemon process or workload monitor which might be deleting the VIP resources created by the ingress controller in the kube-system namespace?

Only cic roles have access to the vips. So nothing else can touch the resource. But no. There is no additional tool interacting with it.

lhw commented 3 years ago

As you were pointing out the other cics. Here is the log of the other CICs from the same time period. https://gist.github.com/lhw/7b9f2e2a2b0992317c1f37155060f9e4

They seem to be reacting to the service even though the service class does not match their supplied value.

lhw commented 3 years ago

After disabling the ipam feature on both the extern and extern-fbt cics it works now. So it looks like the cics seem to be ignoring the service-class.

apoorvak-citrix commented 3 years ago

@lhw Thanks a lot for all the information. we know the root cause now. This will be fixed in the next release.

netscaler / netscaler-k8s-ingress-controller

IPAM / Ingress Controller keep deleting each others VIPs causing service disruptions #409