Skupper exposed services change ClusterIP constantly

muzcategui1106-gs commented 1 year ago

There seems to be an issue with the skupper-router or the skupper-service controller in which ClusterIPs are constantly changing. This creates issues down the line as we have automation that depends on the ClusterIPs of the services.

Issue I am facing exposed services cluster IPs are constantly changing (every couple of hours or so)

Kubernetes Distribution Openshift 4.10

Versions Skupper-site-controller: 1.2.0 Skupper-router: 2.2.0 Skupper-service-controller: 1.2.0 config-sync: 1.2.0

Environment Source cluster sits on internal network and destination cluster where the services are exposed is in AWS. There is connectivity from the internal cluster to the AWS cluster through AWS direct connect and a transit VPC

How to replicate Unfortunately, I am not able to replicate, it just happens sporadically

What I think is happening By definition the ClusterIP can never change. This suggests to me that either the skupper-service-controller is somehow deleting the services and recreating it. The reason could be a connectivity issue in the link between Skupper routers.

Supporting information I am unable to post this at the moment\

Happy to hear thoughts on things that I can look at. I am happy to provide supporting information through more private channels.

grs commented 1 year ago

Try skupper debug events and kubectl get events in the namespace the problem is occurring in.

How are you selecting the services in question for exposure with skupper? Using the cli? Or annotations (on deployment or service resource)?

muzcategui1106-gs commented 1 year ago

the services are being exposed from an internal cluster to the AWS cluster using the annotation on the service

skupper.io/proxy=tcp

Please see detailed events at the end of this message. Regardless where the problem lies (network or skupper router), It is always safe to assume that something will happen and the links will get temporarily interrupted; eg; networking blips. What I propose to mitigate this is that there is a configurable option where users can specify a grace period in which the service-controller wont delete services, should the skupper links get interrupted. This will ensure that even when there are networking blips, issues in the cluster wont propagate longer than the blip due to Services being recreated which could cause issues for other services depending on those exported services.

here are events from both commands you have given me . Please note I have removed names of specific services and replaced with XXX

 ./skupper --namespace skupper-services debug events
NAME                         COUNT                                                                                                                                                                                  AGE
ServiceControllerEvent       127                                                                                                                                                                                    14m6s
                             5     service event for skupper-services/XX                                                                                                                             14m6s
                             1     Checking service for: XXX                                                                                                                                            22m32s
                             1     Checking service for: XXX                                                                                                                                                 22m32s
                             1     Checking service for: XXX                                                                                                                                                    22m32s
                             1     Checking service for: XXX                                                                                                                                                    22m32s
DefinitionMonitorEvent       76                                                                                                                                                                                     14m6s
                             5     service event for skupper-services/XXX                                                                                                                             14m6s
                             1     Service definitions have changed                                                                                                                                                 22m34s
                             1     service event for skupper-services/XXX                                                                                                                                    10h25m52s
                             1     service event for skupper-services/XXX                                                                                                                                        10h25m52s
                             1     service event for skupper-services/XXX                                                                                                                                          10h25m52s
ServiceSyncEvent             120                                                                                                                                                                                    22m34s
                             1     Service interface(s) modified XXX                                                                                                                                    22m34s
                             1     Service interface(s) added                                                                                                                                                       10h25m55s
                                   XXX,XXX,XXX,XXX
                             1     Service sync sender connection to                                                                                                                                                10h25m58s
                                   amqps://skupper-router-local.skupper-services.svc.cluster.local:5671
                                   established
                             1     Service sync receiver connection to                                                                                                                                              10h25m58s
                                   amqps://skupper-router-local.skupper-services.svc.cluster.local:5671
                                   established
                             1     Error receiving updates: dial tcp 11.162.177.64:5671:                                                                                                                            10h25m58s
                                   connect: connection timed out
ServiceControllerUpdateEvent 3                                                                                                                                                                                      10h25m54s
                             3     Updating skupper-internal                                                                                                                                                        10h25m54s
SiteQueryError               111                                                                                                                                                                                    10h25m58s
                             1     Error handling requests: Could not get management agent:                                                                                                                         10h25m58s
                                   Failed to create connection: dial tcp 11.162.177.64:5671:
                                   connect: connection timed out
                             108   Error handling requests: Could not get management agent:                                                                                                                         10h28m10s
                                   Failed to create connection: dial tcp 11.162.177.64:5671:
                                   connect: connection refused
                             1     Error handling requests: Error handling request for                                                                                                                              10h28m10s
                                   fa223687-76e0-4e54-ad15-486d1a7a4d29/skupper-site-query:
                                   Failed reading request from
                                   fa223687-76e0-4e54-ad15-486d1a7a4d29/skupper-site-query: EOF
                             1     Failed to get site url: routes.route.openshift.io                                                                                                                                10h38m0s
                                   "skupper-inter-router" not found
ServiceControllerDeleteEvent 38                                                                                                                                                                                     10h26m41s
                             1     Deleting service XXX                                                                                                                                                        10h26m41s
                             1     No service binding found for XXX                                                                                                                                            10h26m41s
                             1     Deleting service XXX                                                                                                                                                          10h26m42s
                             1     No service binding found for XX                                                                                                                                               10h26m42s
                             1     Deleting service XXX                                                                                                                                                    10h26m43s
IpMappingEvent               11                                                                                                                                                                                     10h28m7s
                             1     11.34.21.127 mapped to skupper-router-6c64485986-m6g2j                                                                                                                           10h28m7s
                             1     mapping for 11.34.15.10 deleted                                                                                                                                                  10h28m9s
                             4     11.34.15.10 mapped to skupper-router-6c64485986-rgbjz                                                                                                                            10h28m9s
                             4      mapped to skupper-router-6c64485986-m6g2j                                                                                                                                       10h28m9s
                             1     11.34.14.58 mapped to                                                                                                                                                            10h37m54s
                                   skupper-service-controller-d756fccfd-lwcbx

kubectl -n skupper-services get events
No resources found in skupper-services namespace.

grs commented 1 year ago

What I propose to mitigate this is that there is a configurable option where users can specify a grace period in which the service-controller wont delete services, should the skupper links get interrupted.

I think this is a good idea. It is probably too sensitive at present and should be adjustable.

muzcategui1106-gs commented 1 year ago

yeah there might be environments where it may need to be aggressive . But in our case I would set it to something big like 1 hour or so

skupperproject / skupper

Skupper exposed services change ClusterIP constantly #972