Open muzcategui1106-gs opened 1 year ago
Try skupper debug events
and kubectl get events
in the namespace the problem is occurring in.
How are you selecting the services in question for exposure with skupper? Using the cli? Or annotations (on deployment or service resource)?
the services are being exposed from an internal cluster to the AWS cluster using the annotation on the service
skupper.io/proxy=tcp
Please see detailed events at the end of this message. Regardless where the problem lies (network or skupper router), It is always safe to assume that something will happen and the links will get temporarily interrupted; eg; networking blips. What I propose to mitigate this is that there is a configurable option where users can specify a grace period in which the service-controller wont delete services, should the skupper links get interrupted. This will ensure that even when there are networking blips, issues in the cluster wont propagate longer than the blip due to Services being recreated which could cause issues for other services depending on those exported services.
here are events from both commands you have given me . Please note I have removed names of specific services and replaced with XXX
./skupper --namespace skupper-services debug events
NAME COUNT AGE
ServiceControllerEvent 127 14m6s
5 service event for skupper-services/XX 14m6s
1 Checking service for: XXX 22m32s
1 Checking service for: XXX 22m32s
1 Checking service for: XXX 22m32s
1 Checking service for: XXX 22m32s
DefinitionMonitorEvent 76 14m6s
5 service event for skupper-services/XXX 14m6s
1 Service definitions have changed 22m34s
1 service event for skupper-services/XXX 10h25m52s
1 service event for skupper-services/XXX 10h25m52s
1 service event for skupper-services/XXX 10h25m52s
ServiceSyncEvent 120 22m34s
1 Service interface(s) modified XXX 22m34s
1 Service interface(s) added 10h25m55s
XXX,XXX,XXX,XXX
1 Service sync sender connection to 10h25m58s
amqps://skupper-router-local.skupper-services.svc.cluster.local:5671
established
1 Service sync receiver connection to 10h25m58s
amqps://skupper-router-local.skupper-services.svc.cluster.local:5671
established
1 Error receiving updates: dial tcp 11.162.177.64:5671: 10h25m58s
connect: connection timed out
ServiceControllerUpdateEvent 3 10h25m54s
3 Updating skupper-internal 10h25m54s
SiteQueryError 111 10h25m58s
1 Error handling requests: Could not get management agent: 10h25m58s
Failed to create connection: dial tcp 11.162.177.64:5671:
connect: connection timed out
108 Error handling requests: Could not get management agent: 10h28m10s
Failed to create connection: dial tcp 11.162.177.64:5671:
connect: connection refused
1 Error handling requests: Error handling request for 10h28m10s
fa223687-76e0-4e54-ad15-486d1a7a4d29/skupper-site-query:
Failed reading request from
fa223687-76e0-4e54-ad15-486d1a7a4d29/skupper-site-query: EOF
1 Failed to get site url: routes.route.openshift.io 10h38m0s
"skupper-inter-router" not found
ServiceControllerDeleteEvent 38 10h26m41s
1 Deleting service XXX 10h26m41s
1 No service binding found for XXX 10h26m41s
1 Deleting service XXX 10h26m42s
1 No service binding found for XX 10h26m42s
1 Deleting service XXX 10h26m43s
IpMappingEvent 11 10h28m7s
1 11.34.21.127 mapped to skupper-router-6c64485986-m6g2j 10h28m7s
1 mapping for 11.34.15.10 deleted 10h28m9s
4 11.34.15.10 mapped to skupper-router-6c64485986-rgbjz 10h28m9s
4 mapped to skupper-router-6c64485986-m6g2j 10h28m9s
1 11.34.14.58 mapped to 10h37m54s
skupper-service-controller-d756fccfd-lwcbx
kubectl -n skupper-services get events
No resources found in skupper-services namespace.
What I propose to mitigate this is that there is a configurable option where users can specify a grace period in which the service-controller wont delete services, should the skupper links get interrupted.
I think this is a good idea. It is probably too sensitive at present and should be adjustable.
yeah there might be environments where it may need to be aggressive . But in our case I would set it to something big like 1 hour or so
There seems to be an issue with the skupper-router or the skupper-service controller in which ClusterIPs are constantly changing. This creates issues down the line as we have automation that depends on the ClusterIPs of the services.
Issue I am facing exposed services cluster IPs are constantly changing (every couple of hours or so)
Kubernetes Distribution Openshift 4.10
Versions Skupper-site-controller: 1.2.0 Skupper-router: 2.2.0 Skupper-service-controller: 1.2.0 config-sync: 1.2.0
Environment Source cluster sits on internal network and destination cluster where the services are exposed is in AWS. There is connectivity from the internal cluster to the AWS cluster through AWS direct connect and a transit VPC
How to replicate Unfortunately, I am not able to replicate, it just happens sporadically
What I think is happening By definition the ClusterIP can never change. This suggests to me that either the skupper-service-controller is somehow deleting the services and recreating it. The reason could be a connectivity issue in the link between Skupper routers.
Supporting information I am unable to post this at the moment\
Happy to hear thoughts on things that I can look at. I am happy to provide supporting information through more private channels.