netscaler / netscaler-k8s-ingress-controller

NetScaler Ingress Controller for Kubernetes:
https://developer-docs.citrix.com/projects/citrix-k8s-ingress-controller/en/latest/
307 stars 91 forks source link

Faulty Netscaler resource garbage collection when Kubernetes API requests fail #665

Closed simon-wessel closed 2 months ago

simon-wessel commented 2 months ago

Describe the bug

There is a bug where the CIC deletes resources on the Netscaler if specific Kubernetes API requests fail due to any reason. When the Kube API requests fail, the CIC wrongly assumes that content switches are "stale". After deleting the content switches, the CIC is stuck, because it cannot find the content switches which it just deleted (Exception KeyError).

We also have an open support request, but due to the huge impact and latest technical insights (see additional context below), we wanted to open another communication channel with the developers. Maybe there are also other affected users here who can provide further insight.

To Reproduce We are not able to reproduce this behavior, but it has happened twice now. The order of events is as follows:

  1. Kubernetes API request fails
  2. CIC assumes CS should be deleted
  3. CIC is unable to start, because it is not able to find the content switches which should belong to the ingresses in the cluster

We fixed this state by manually restoring the deleted resources from our backups until the CIC was able to start again, but this is a complicated and long process.

Even after the CIC was able to start again, not all resources were reconciled and we had to manually detect missing resources like rewrite and responder policies. We were not able to restore these from backups as they have unique IDs in their names. We had to manually recreate the CRD instances in the cluster so that the CIC recreated them.

  1. Version of the NetScaler Ingress Controller

1.42.12

  1. Version of MPX/VPX/CPX

NS13.1 54.29.nc

  1. Environment variables (minus secrets)

Our environment variables can be seen in the support ticket.

Expected behavior

There should be error handling. A failed API request should not result in the assumption that there are no ingresses.

Logs

We cannot provide our logs here, but our logs are part of the ongoing support ticket.

Additional context

We have looked at the Python code in the image and have an idea why this bug happens. We would appreciate it very much, if you would look into it.

  1. ADC Sync starts (function configure_cpx_for_all_apps in kubernetes/kubernetes.py)
  1. get_all_ingresses() is invoked from configure_cpx_for_all_apps()
  2. get_all_ingresses_raw() is invoked from get_all_ingresses()
  3. _get() is invoked from get_all_ingresses_raw()
  4. call_K8sClientHelper_method() is invoked from _get() with method K8sClientHelper_GET_METHOD
  5. K8sClientHelper().getis invoked from call_K8sClientHelper_method()
  6. This function contains the following code, that catches the RequestException and therefore returns False, None
        try:
            response = self.client.get(url=api, namespace=namespace)
        except requests.exceptions.RequestException as e:
            logger.error('RequestError while calling  {}:{}'.format(api, e))
            return False, None
  7. The log messages from the except-block can be seen below
    • 2024-07-29 20:40:06,441 - ERROR - [clienthelper.py:get:38] (MainThread) RequestError while calling /services:HTTPSConnectionPool(host='10.233.0.1', port=443): Read timed out. (read timeout=10)
    • 2024-07-29 20:40:16,456 - ERROR - [clienthelper.py:get:38] (MainThread) RequestError while calling /ingresses:HTTPSConnectionPool(host='10.233.0.1', port=443): Read timed out. (read timeout=10)
  8. K8sClientHelper().get() returns False, None
  9. call_K8sClientHelper_method() returns False, None
  10. _get() returns False, None
  11. get_all_ingresses_raw()returns {}, None because of these lines:
        success, response = self._get(api) # This functions returns False, None
        ingresses = []
        if not success and response and response.status_code >= 300:
            status = response.json()
            if status['reason'] == 'NotFound':
                logger.info("Api %s not found" % api)
        if not success: # Success is false, therefore the function returns here with an empty ingresses list
            return ingresses, None
  12. get_all_ingresses() returns {}
  13. ingresses in configure_cpx_for_all_apps() is initialized with {}
  14. services in configure_cpx_for_all_apps() is initialized with {} for the same reasons
  15. csvs_ingress_association and service_to_nsapps_mapping in configure_cpx_for_all_apps() are initialized with {}
  16. configure_apps_during_sync() is invoked from configure_cpx_for_all_apps() with csvs_ingress_association = {} and service_to_nsapps_mapping = {}
  17. kube_csvs_set in configure_apps_during_sync() is empty, because of the passed empty variables
  18. cleanup_ns_cs_apps() is invoked from configure_apps_during_sync() with kube_csvs_set = {}
  19. _cleanup_ns_cs_apps() is invoked from cleanup_ns_cs_apps() with kube_csvs_set = {}
  20. _find_ns_csvs_to_delete() is invoked from _cleanup_ns_cs_apps() with kube_csvs_set = {}
  21. csvs_to_delete_set = ns_csvs_set - kube_csvs_set is executed and kube_csvs_set is based on the faulty, empty ingress list. Therefore csvs_to_delete_set now contains ALL content switches.
  22. CIC continues to delete all content switches based on the content ofcsvs_to_delete_set
    • 2024-07-29 20:40:33,821 - INFO - [nitrointerface.py:_cleanup_ns_cs_apps:1517] (MainThread) ADC-SYNC: Stale CS Vservers to be deleted: {<Long JSON list of CS - Redacted>}
subashd commented 2 months ago

hi @simon-wessel We have fixed the issue when NSIC clears the NetScaler config when there is issue in communication between NSIC and NetScaler in NSIC v1.42.12. Could you please check the version of NSIC and share the SHA of the image, while we are trying to repro the same? Please find the release notes for the same here

simon-wessel commented 2 months ago

Hi @subashd , thank you for your quick response. You are correct, we were indeed using 1.41.5. We upgraded to 1.42.12 the day after and ran into another deletion of "stale" CS vservers, but that may be a result of the incident.

I can see that you added this block to the get helper function:

            if 'Caused by ConnectTimeoutError' in e:
                newResponse = Response()
                newResponse.status_code = HttpRespCodes.TIMEDOUT
                return False, newResponse

I am glad to see, that this should indeed help in our connect timeout case. I am wondering if rather than just handling ConnectTimeoutError, all errors should be handled? The same issue could still happen for any other possible exception like SSLError, Timeout, ... (see list of exceptions).

subashd commented 2 months ago

thank you @simon-wessel We will look into the suggestions and will support other possible exceptions on future releases. Closing this issue for now.

simon-wessel commented 2 months ago

Hi @subashd, I am wondering if this should be kept open while the remaining problems are not resolved?

Furthermore I would like to suggest to consider refactoring the error handling. Due to the patch, one error is being handled in a very specific way now. However, other Exceptions and misbehavior might still cause unexpected results.

subashd commented 2 months ago

hi @simon-wessel Please do not worry, we will review all the errors that are mentioned in the list and add handling for the same.