Closed simon-wessel closed 2 months ago
hi @simon-wessel We have fixed the issue when NSIC clears the NetScaler config when there is issue in communication between NSIC and NetScaler in NSIC v1.42.12. Could you please check the version of NSIC and share the SHA of the image, while we are trying to repro the same? Please find the release notes for the same here
Hi @subashd , thank you for your quick response. You are correct, we were indeed using 1.41.5. We upgraded to 1.42.12 the day after and ran into another deletion of "stale" CS vservers, but that may be a result of the incident.
I can see that you added this block to the get
helper function:
if 'Caused by ConnectTimeoutError' in e:
newResponse = Response()
newResponse.status_code = HttpRespCodes.TIMEDOUT
return False, newResponse
I am glad to see, that this should indeed help in our connect timeout case. I am wondering if rather than just handling ConnectTimeoutError
, all errors should be handled? The same issue could still happen for any other possible exception like SSLError, Timeout, ... (see list of exceptions).
thank you @simon-wessel We will look into the suggestions and will support other possible exceptions on future releases. Closing this issue for now.
Hi @subashd, I am wondering if this should be kept open while the remaining problems are not resolved?
Furthermore I would like to suggest to consider refactoring the error handling. Due to the patch, one error is being handled in a very specific way now. However, other Exceptions and misbehavior might still cause unexpected results.
hi @simon-wessel Please do not worry, we will review all the errors that are mentioned in the list and add handling for the same.
Describe the bug
There is a bug where the CIC deletes resources on the Netscaler if specific Kubernetes API requests fail due to any reason. When the Kube API requests fail, the CIC wrongly assumes that content switches are "stale". After deleting the content switches, the CIC is stuck, because it cannot find the content switches which it just deleted (Exception KeyError).
We also have an open support request, but due to the huge impact and latest technical insights (see additional context below), we wanted to open another communication channel with the developers. Maybe there are also other affected users here who can provide further insight.
To Reproduce We are not able to reproduce this behavior, but it has happened twice now. The order of events is as follows:
We fixed this state by manually restoring the deleted resources from our backups until the CIC was able to start again, but this is a complicated and long process.
Even after the CIC was able to start again, not all resources were reconciled and we had to manually detect missing resources like rewrite and responder policies. We were not able to restore these from backups as they have unique IDs in their names. We had to manually recreate the CRD instances in the cluster so that the CIC recreated them.
1.42.12
NS13.1 54.29.nc
Our environment variables can be seen in the support ticket.
Expected behavior
There should be error handling. A failed API request should not result in the assumption that there are no ingresses.
Logs
We cannot provide our logs here, but our logs are part of the ongoing support ticket.
Additional context
We have looked at the Python code in the image and have an idea why this bug happens. We would appreciate it very much, if you would look into it.
configure_cpx_for_all_apps
inkubernetes/kubernetes.py
)2024-07-29 20:39:56,427 - INFO - [kubernetes.py:configure_cpx_for_all_apps:4708] (MainThread) ADC-SYNC:STARTED
get_all_ingresses()
is invoked fromconfigure_cpx_for_all_apps()
get_all_ingresses_raw()
is invoked fromget_all_ingresses()
_get()
is invoked fromget_all_ingresses_raw()
call_K8sClientHelper_method()
is invoked from_get()
with methodK8sClientHelper_GET_METHOD
K8sClientHelper().get
is invoked fromcall_K8sClientHelper_method()
False, None
except
-block can be seen below2024-07-29 20:40:06,441 - ERROR - [clienthelper.py:get:38] (MainThread) RequestError while calling /services:HTTPSConnectionPool(host='10.233.0.1', port=443): Read timed out. (read timeout=10)
2024-07-29 20:40:16,456 - ERROR - [clienthelper.py:get:38] (MainThread) RequestError while calling /ingresses:HTTPSConnectionPool(host='10.233.0.1', port=443): Read timed out. (read timeout=10)
K8sClientHelper().get()
returnsFalse, None
call_K8sClientHelper_method()
returnsFalse, None
_get()
returnsFalse, None
get_all_ingresses_raw()
returns{}, None
because of these lines:get_all_ingresses()
returns{}
ingresses
inconfigure_cpx_for_all_apps()
is initialized with{}
services
inconfigure_cpx_for_all_apps()
is initialized with{}
for the same reasonscsvs_ingress_association
andservice_to_nsapps_mapping
inconfigure_cpx_for_all_apps()
are initialized with{}
configure_apps_during_sync()
is invoked fromconfigure_cpx_for_all_apps()
withcsvs_ingress_association = {}
andservice_to_nsapps_mapping = {}
kube_csvs_set
inconfigure_apps_during_sync()
is empty, because of the passed empty variablescleanup_ns_cs_apps()
is invoked fromconfigure_apps_during_sync()
withkube_csvs_set = {}
_cleanup_ns_cs_apps()
is invoked fromcleanup_ns_cs_apps()
withkube_csvs_set = {}
_find_ns_csvs_to_delete()
is invoked from_cleanup_ns_cs_apps()
withkube_csvs_set = {}
csvs_to_delete_set = ns_csvs_set - kube_csvs_set
is executed andkube_csvs_set
is based on the faulty, empty ingress list. Thereforecsvs_to_delete_set
now contains ALL content switches.csvs_to_delete_set
2024-07-29 20:40:33,821 - INFO - [nitrointerface.py:_cleanup_ns_cs_apps:1517] (MainThread) ADC-SYNC: Stale CS Vservers to be deleted: {<Long JSON list of CS - Redacted>}