kubernetes-retired / service-catalog

Consume services in Kubernetes using the Open Service Broker API
https://svc-cat.io
Apache License 2.0
1.05k stars 385 forks source link

Provide delay after performing orphan mitigation when Service Instance was failed #2025

Closed polskikiel closed 6 years ago

polskikiel commented 6 years ago

Hello, We wrote about our issue in #2006. It's about infinite loop when Service Instance provision fails, e.g. when some condition from our business logic wasn't fulfilled.

The loop looks like this:

start provisioning -> fail, requires orphan mitigation -> perform orphan mitigation 
-> start provisioning -> …

In the issue #2006, @nilebox suggested using reconciliation-retry-duration flag, but it didn't help us. It defines how much time does the Service Catalog have to ask for the last operation. In our case, each step is another operation so it won't work. We need a similar solution but for the whole loop showed above.

These are the logs from the controller-manager when I started to provision a Service Instance which fails:

08:30:35.840997       1 controller_instance.go:1143] ServiceInstance "production/promotions-service1": Provision request for ServiceInstance in-flight to Broker
08:30:36.044199       1 controller_instance.go:1143] ServiceInstance "production/promotions-service1": The instance is being provisioned asynchronously
08:30:36.053577       1 event.go:218] Event(v1.ObjectReference{Kind:"ServiceInstance", Namespace:"production", Name:"promotions-service1", UID:"3edc6ed3-5363-11e8-83ff-0242ac110022", APIVersion:"servicecatalog.k8s.io", ResourceVersion:"2317", FieldPath:""}): type: 'Normal' reason: 'Provisioning' The instance is being provisioned asynchronously
08:30:37.055727       1 controller_instance.go:1143] ServiceInstance "production/promotions-service1": Provision call failed: Forbidden provisioning instance [3edc6a28-5363-11e8-83ff-0242ac110022] for remote environment [id: ac031e8c-9aa4-4cb7-8999-0d358726ffaa] in namespace: [production]. Reason: [already activated]
08:30:37.055791       1 controller_instance.go:1143] ServiceInstance "production/promotions-service1": Provision call failed: Forbidden provisioning instance [3edc6a28-5363-11e8-83ff-0242ac110022] for remote environment [id: ac031e8c-9aa4-4cb7-8999-0d358726ffaa] in namespace: [production]. Reason: [already activated]
08:30:37.055872       1 controller_instance.go:1143] ServiceInstance "production/promotions-service1": The instance provision call failed with an ambiguous error; attempting to deprovision the instance in order to mitigate an orphaned resource
08:30:37.056042       1 event.go:218] Event(v1.ObjectReference{Kind:"ServiceInstance", Namespace:"production", Name:"promotions-service1", UID:"3edc6ed3-5363-11e8-83ff-0242ac110022", APIVersion:"servicecatalog.k8s.io", ResourceVersion:"2318", FieldPath:""}): type: 'Warning' reason: 'ProvisionCallFailed' Provision call failed: Forbidden provisioning instance [3edc6a28-5363-11e8-83ff-0242ac110022] for remote environment [id: ac031e8c-9aa4-4cb7-8999-0d358726ffaa] in namespace: [production]. Reason: [already activated]
08:30:37.056060       1 event.go:218] Event(v1.ObjectReference{Kind:"ServiceInstance", Namespace:"production", Name:"promotions-service1", UID:"3edc6ed3-5363-11e8-83ff-0242ac110022", APIVersion:"servicecatalog.k8s.io", ResourceVersion:"2318", FieldPath:""}): type: 'Warning' reason: 'StartingInstanceOrphanMitigation' The instance provision call failed with an ambiguous error; attempting to deprovision the instance in order to mitigate an orphaned resource
08:30:37.248135       1 controller_instance.go:1143] ServiceInstance "production/promotions-service1": The instance is being deprovisioned asynchronously
08:30:37.258984       1 event.go:218] Event(v1.ObjectReference{Kind:"ServiceInstance", Namespace:"production", Name:"promotions-service1", UID:"3edc6ed3-5363-11e8-83ff-0242ac110022", APIVersion:"servicecatalog.k8s.io", ResourceVersion:"2319", FieldPath:""}): type: 'Normal' reason: 'Deprovisioning' The instance is being deprovisioned asynchronously
08:30:39.067922       1 controller_instance.go:1143] ServiceInstance "production/promotions-service1": Orphan mitigation was completed successfully
08:30:39.074778       1 event.go:218] Event(v1.ObjectReference{Kind:"ServiceInstance", Namespace:"production", Name:"promotions-service1", UID:"3edc6ed3-5363-11e8-83ff-0242ac110022", APIVersion:"servicecatalog.k8s.io", ResourceVersion:"2320", FieldPath:""}): type: 'Normal' reason: 'OrphanMitigationSuccessful' Orphan mitigation was completed successfully
08:30:39.240388       1 controller_instance.go:1143] ServiceInstance "production/promotions-service1": Provision request for ServiceInstance in-flight to Broker
08:30:39.445902       1 controller_instance.go:1143] ServiceInstance "production/promotions-service1": The instance is being provisioned asynchronously
08:30:39.451549       1 event.go:218] Event(v1.ObjectReference{Kind:"ServiceInstance", Namespace:"production", Name:"promotions-service1", UID:"3edc6ed3-5363-11e8-83ff-0242ac110022", APIVersion:"servicecatalog.k8s.io", ResourceVersion:"2322", FieldPath:""}): type: 'Normal' reason: 'Provisioning' The instance is being provisioned asynchronously
08:30:40.454954       1 controller_instance.go:1143] ServiceInstance "production/promotions-service1": Provision call failed: Forbidden provisioning instance [3edc6a28-5363-11e8-83ff-0242ac110022] for remote environment [id: ac031e8c-9aa4-4cb7-8999-0d358726ffaa] in namespace: [production]. Reason: [already activated]
08:30:40.455026       1 controller_instance.go:1143] ServiceInstance "production/promotions-service1": Provision call failed: Forbidden provisioning instance [3edc6a28-5363-11e8-83ff-0242ac110022] for remote environment [id: ac031e8c-9aa4-4cb7-8999-0d358726ffaa] in namespace: [production]. Reason: [already activated]
08:30:40.455087       1 controller_instance.go:1143] ServiceInstance "production/promotions-service1": The instance provision call failed with an ambiguous error; attempting to deprovision the instance in order to mitigate an orphaned resource
08:30:40.455744       1 event.go:218] Event(v1.ObjectReference{Kind:"ServiceInstance", Namespace:"production", Name:"promotions-service1", UID:"3edc6ed3-5363-11e8-83ff-0242ac110022", APIVersion:"servicecatalog.k8s.io", ResourceVersion:"2323", FieldPath:""}): type: 'Warning' reason: 'ProvisionCallFailed' Provision call failed: Forbidden provisioning instance [3edc6a28-5363-11e8-83ff-0242ac110022] for remote environment [id: ac031e8c-9aa4-4cb7-8999-0d358726ffaa] in namespace: [production]. Reason: [already activated]
08:30:40.455777       1 event.go:218] Event(v1.ObjectReference{Kind:"ServiceInstance", Namespace:"production", Name:"promotions-service1", UID:"3edc6ed3-5363-11e8-83ff-0242ac110022", APIVersion:"servicecatalog.k8s.io", ResourceVersion:"2323", FieldPath:""}): type: 'Warning' reason: 'StartingInstanceOrphanMitigation' The instance provision call failed with an ambiguous error; attempting to deprovision the instance in order to mitigate an orphaned resource
08:30:40.642871       1 controller_instance.go:1143] ServiceInstance "production/promotions-service1": The instance is being deprovisioned asynchronously
08:30:40.654647       1 event.go:218] Event(v1.ObjectReference{Kind:"ServiceInstance", Namespace:"production", Name:"promotions-service1", UID:"3edc6ed3-5363-11e8-83ff-0242ac110022", APIVersion:"servicecatalog.k8s.io", ResourceVersion:"2324", FieldPath:""}): type: 'Normal' reason: 'Deprovisioning' The instance is being deprovisioned asynchronously
08:30:40.765473       1 controller_manager.go:414] Health-checking connection with service-catalog API server
08:30:42.471662       1 controller_instance.go:1143] ServiceInstance "production/promotions-service1": Orphan mitigation was completed successfully
08:30:42.482370       1 event.go:218] Event(v1.ObjectReference{Kind:"ServiceInstance", Namespace:"production", Name:"promotions-service1", UID:"3edc6ed3-5363-11e8-83ff-0242ac110022", APIVersion:"servicecatalog.k8s.io", ResourceVersion:"2325", FieldPath:""}): type: 'Normal' reason: 'OrphanMitigationSuccessful' Orphan mitigation was completed successfully
08:30:42.639328       1 controller_instance.go:1143] ServiceInstance "production/promotions-service1": Provision request for ServiceInstance in-flight to Broker

and so on forever..

As you can see, the Service Catalog continuously spams our broker with provision and deprovision requests without any delay or maxRetries (or maxRetryDuration) amount.

We would expect a flow like this:

(start provisioning -> fail, requires orphan mitigation -> perform orphan mitigation ->  exponential delay) x maxRetries

Our idea is to provide an exponential delay after performing an orphan mitigation to avoid spamming the brokers. Let us know what do you think about it :) Also let us know if you need the code snippets from the Service Catalog where the whole cycle is executed. Thank you in advance.

polskikiel commented 6 years ago

I'm using Service Catalog in version 0.1.17

jianzhangbjz commented 6 years ago

@polskikiel Thank you for your report and suggestion. We also caught this problem. And, the version should be v0.1.17, right?

nilebox commented 6 years ago

Thanks for reporting @polskikiel, the behavior you described suggests that we have a bug in our code.

Our idea is to provide an exponential delay after performing an orphan mitigation to avoid spamming the brokers.

That's exactly how we agreed it to behave, I think. @kibbles-n-bytes can you you confirm? Any ideas what causes this bug. From memory, it means that we confirm some stage of the queue key processing as successfully complete (and don't return the error in the result that would trigger exponential retry delay)?

polskikiel commented 6 years ago

@jianzhangbjz Yes, sorry I meant v0.1.17.

kibbles-n-bytes commented 6 years ago

@nilebox Yes, that's the intended behavior, so we definitely have a bug. My guess is the successful OM is clearing the rate limiter for the instance queue, so there isn't a backoff when the provision retries, though I haven't looked at the code in a while.

jboyd01 commented 6 years ago

It's easy to reproduce this. I took the ups broker and modified it to return a 205 on instance provisioning. This triggers the tight loop described above.

From what I'm seeing each cycle basically passes through the controller's service instance work queue 3 times before it starts all over again:

1) reconciliation determines it's an add, sets up the status and inProgressProperties and quits out of the reconciliation loop with success.

2) reconciliation sees that its an add, sends the provision request to the broker, gets back the 205 status and sets orphan mitigation as needed. This run of reconciliation returns with an error, causing the worker to requeue the instances with backoff.

3) reconciliation sees orphan mitigation is needed and sends delete request to broker, this succeeds, updates the status and returns success to the worker.

The process then starts over again. The only place we do any backoff is in step 2, but the initial backoff is set to 5 milliseconds and we keep resetting the requeues to zero in all steps but 2. So we never achieve any significant backoff, and we never get higher then 1 on a retry count.

Unless steps 1 and 3 result in a requeue, we're not going to achieve a significant backoff from this queue. I'm unaware of anything else that would be gating how frequently we'd retry. The behavior we see in step 1 (returning success to the worker) interrupts any notion of retries - I understand that we must separate status updates out in the reconciliation loop, but it looks to me like this breaks the fundamentals around how the backoff and retry is supposed to work here - we keep resetting any state that indicates we failed an overall operation and need to retry.

I updated the code so passes 1 and 3 result in the worker doing requeue, but the backoff is never significant and even after we stop requeue-ing because we exceeded the maxRetries, the instance gets put right back into the queue and nothing is really changed. I think the watch is what puts it back in the queue?

nilebox commented 6 years ago

I understand that we must separate status updates out in the reconciliation loop, but it looks to me like this breaks the fundamentals around how the backoff and retry is supposed to work here

I agree, and the reason is that we treat 3 continuous stages of the reconciliation as part of one operation to meet OSB requirements, while normally in Kubernetes there are no stages?

jboyd01 commented 6 years ago

Trying to debug this, I made the instance status update and Orphan Mitigation code return a failure to the worker so it would requeue with a backoff. I can see that logging as I'd expect. But I still see a very tight loop with the worker constantly being invoked - there is no backoff between invocations.

The worker() (https://github.com/kubernetes-incubator/service-catalog/blob/master/pkg/controller/controller.go#L304-L335) seems to do the right thing when we indicate its a failure (I added debug to the https://github.com/kubernetes-incubator/service-catalog/blob/master/vendor/k8s.io/client-go/util/workqueue/default_rate_limiters.go and can see its using increasing backoffs) but every time we exit the worker, instead of seeing for the backoff time elapse, I see the worker immediately re-invoked for the instance. I believe the problem is the instance is getting requeued by https://github.com/kubernetes-incubator/service-catalog/blob/master/pkg/controller/controller_instance.go#L103 which is invoked from a thread with this call stack:

vendor/k8s.io/client-go/tools/cache.(*ResourceEventHandlerFuncs).OnUpdate(0xc4206a8ea0, 0x1a8d080, 0xc4201f2b40, 0x1a8d080, 0xc4201f3d40)
vendor/k8s.io/client-go/tools/cache.(*processorListener).run.func1.1(0xc4207480a0, 0x0, 0xc420025300)
vendor/k8s.io/apimachinery/pkg/util/wait.ExponentialBackoff(0x989680, 0x3ff0000000000000, 0x3fb999999999999a, 0x5, 0xc42073adf0, 0x18, 0x8a0980)
vendor/k8s.io/client-go/tools/cache.(*processorListener).run.func1()
vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc42002cf68)
vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc42073af68, 0xdf8475800, 0x0, 0x1829501, 0xc420746060)
vendor/k8s.io/apimachinery/pkg/util/wait.Until(0xc42002cf68, 0xdf8475800, 0xc420746060)
vendor/k8s.io/client-go/tools/cache.(*processorListener).run(0xc420712100)
vendor/k8s.io/client-go/tools/cache.(*processorListener).(github.com/kubernetes-incubator/service-catalog/vendor/k8s.io/client-go/tools/cache.run)-fm()
vendor/k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1(0xc42026a680, 0xc42043a000)
created by github.com/kubernetes-incubator/service-catalog/vendor/k8s.io/apimachinery/pkg/util/wait.(*Group).Start

which I think is a watch on the instance kicking because it sees an update to the instance.

Does this make sense? Should the watch be causing the reconciliation loop to fire immediately like this vs waiting for the backoff?

@kibbles-n-bytes this kind of sounds like #1143

kibbles-n-bytes commented 6 years ago

@jboyd01 I believe there are issues with our rate-limiting of non-async operations. My guess is it has to do with our handlers here: https://github.com/kubernetes-incubator/service-catalog/blob/9678178de2bb7ec29c4be24915d5a4eac2b6f7e0/pkg/controller/controller_instance.go#L96-L114

Notice that on instanceUpdate we are doing a c.instanceQueue.Add(key). I believe this bypasses the rate-limiter, so it'll never actually back-off. Just a hunch; I haven't tested it.

Could you try changing that line to c.instanceQueue.AddRateLimited(key) and see if the behavior changes?

jboyd01 commented 6 years ago

@kibbles-n-bytes that definitely helps - I actually now see some exponential backoff. By the time we get to max requeues (15), we have a 40 second delay. It feels like we could stand to take a 2nd look at the starting delay factor - its really small at 5 milliseconds with a max repeat of only 15.

Eventually we hit max retries so the worker doesn't requeue and does a queue.Forget(). But a watch immediately triggers and puts the instance back into the queue and we start all over again resuming reconciliation where we left off which restarts the attempt to provision -> orphan mitigation all over again.

If we want to use the max requeue to determine when to stop this cycle of attempt to provision -> orphan mitigation, reattempt... then I think we'll need some new logic that changes the instance's status indicating we shouldn't re-attempt to provision again. I don't know that the queue's NumRequeues(key) is a good mechanism to rely on for this though, I think perhaps a new provision attempt counter in the status would be better. Or perhaps we could tell from the events i.e., the number of times we've seen ProvisionCallFailed

Events:
  Type     Reason                             Age                 From                                Message
  ----     ------                             ----                ----                                -------
  Warning  ReferencesNonexistentServiceClass  39m (x12 over 39m)  service-catalog-controller-manager  References a non-existent ClusterServiceClass {ClassExternalName:"user-provided-service"} or there is more than one (found: 0)
  Warning  StartingInstanceOrphanMitigation   37m (x5 over 38m)   service-catalog-controller-manager  The instance provision call failed with an ambiguous error; attempting to deprovision the instance in order to mitigate an orphaned resource
  Normal   OrphanMitigationSuccessful         14m (x49 over 38m)  service-catalog-controller-manager  Orphan mitigation was completed successfully
  Warning  ProvisionCallFailed                4m (x71 over 38m)   service-catalog-controller-manager  Error provisioning ServiceInstance of ClusterServiceClass (K8S: "4f6e6cf6-ffdd-425f-a2c7-3c9258ad2468" ExternalName: "user-provided-service") at ClusterServiceBroker "ups-broker": Status: 205; ErrorMessage: <nil>; Description: <nil>; ResponseError: <nil>

I'm not certain why I have those ReferencesNonexistentServiceClass events in there...

kibbles-n-bytes commented 6 years ago

Eventually we hit max retries so the worker doesn't requeue and does a queue.Forget(). But a watch immediately triggers and puts the instance back into the queue and we start all over again resuming reconciliation where we left off which restarts the attempt to provision -> orphan mitigation all over again.

Hah! That's funny, but makes total sense. It's an unfortunate consequence of having two separate areas where we're dealing with queuing.

What is our desired behavior, and what matches other Kubernetes resource backoffs? The way it was intended, we were supposed to retry with backoff, until when we hit the maxRetries, where we stop. And then on the next watch sync, we would restart the retry loop with our rate limiter reset.

We could certainly fix it to do that, but depending on what other resources do, and what we desire, we may want to play around more with our queuing code. It may be that we should do quick retries only for things not hitting the broker, but then we have much longer retry intervals for things that communicate across the wire (like we do for async polling). Just depends what we want.

ash2k commented 6 years ago

I've just observed this issue with one of our failing ServiceInstance. SC retries without backoff.

jboyd01 commented 6 years ago

@ash2k was this failure on synchronous provisioning? If not, any additional details would be great.

jboyd01 commented 6 years ago

I see that when this issue is run with synchronous provisioning & deprovisioning with provisioning made to always fails, we actually invoke the provision call twice in a row and then the orphan mitigation twice in a row.

log details

``` I0523 15:19:16.172209 1 controller_instance.go:387] ServiceInstance "myproject/my-service-instance3": Provisioning a new ServiceInstance of ClusterServiceClass (K8S: "4f6e6cf6-ffdd-425f-a2c7-3c9258ad2468" ExternalName: "user-provided-service") at ClusterServiceBroker "ups-broker" I0523 15:19:16.202978 1 client.go:206] handling failure responses I0523 15:19:16.203076 1 controller_instance.go:1143] ServiceInstance "myproject/my-service-instance3": Error provisioning ServiceInstance of ClusterServiceClass (K8S: "4f6e6cf6-ffdd-425f-a2c7-3c9258ad2468" ExternalName: "user-provided-service") at ClusterServiceBroker "ups-broker": Status: 205; ErrorMessage: ; Description: ; ResponseError: I0523 15:19:16.203092 1 controller_instance.go:1144] ServiceInstance "myproject/my-service-instance3": Setting condition "Ready" to False I0523 15:19:16.203109 1 controller_instance.go:1143] ServiceInstance "myproject/my-service-instance3": Error provisioning ServiceInstance of ClusterServiceClass (K8S: "4f6e6cf6-ffdd-425f-a2c7-3c9258ad2468" ExternalName: "user-provided-service") at ClusterServiceBroker "ups-broker": Status: 205; ErrorMessage: ; Description: ; ResponseError: I0523 15:19:16.203106 1 event.go:218] Event(v1.ObjectReference{Kind:"ServiceInstance", Namespace:"myproject", Name:"my-service-instance3", UID:"a7bdd210-5e9c-11e8-a6dc-0242ac110008", APIVersion:"servicecatalog.k8s.io", ResourceVersion:"20", FieldPath:""}): type: 'Warning' reason: 'ProvisionCallFailed' Error provisioning ServiceInstance of ClusterServiceClass (K8S: "4f6e6cf6-ffdd-425f-a2c7-3c9258ad2468" ExternalName: "user-provided-service") at ClusterServiceBroker "ups-broker": Status: 205; ErrorMessage: ; Description: ; ResponseError: I0523 15:19:16.203125 1 controller_instance.go:1144] ServiceInstance "myproject/my-service-instance3": Setting condition "OrphanMitigation" to True I0523 15:19:16.203144 1 controller_instance.go:1182] ServiceInstance "myproject/my-service-instance3": Setting lastTransitionTime, condition "OrphanMitigation" to 2018-05-23 15:19:16.203103661 +0000 UTC m=+1123.166465631 I0523 15:19:16.203156 1 controller_instance.go:1143] ServiceInstance "myproject/my-service-instance3": The instance provision call failed with an ambiguous error; attempting to deprovision the instance in order to mitigate an orphaned resource I0523 15:19:16.203164 1 controller_instance.go:1144] ServiceInstance "myproject/my-service-instance3": Setting condition "Ready" to False I0523 15:19:16.203172 1 controller_instance.go:1211] ServiceInstance "myproject/my-service-instance3": Updating status I0523 15:19:16.203238 1 event.go:218] Event(v1.ObjectReference{Kind:"ServiceInstance", Namespace:"myproject", Name:"my-service-instance3", UID:"a7bdd210-5e9c-11e8-a6dc-0242ac110008", APIVersion:"servicecatalog.k8s.io", ResourceVersion:"20", FieldPath:""}): type: 'Warning' reason: 'StartingInstanceOrphanMitigation' The instance provision call failed with an ambiguous error; attempting to deprovision the instance in order to mitigate an orphaned resource I0523 15:19:16.207423 1 controller.go:366] Error syncing ServiceInstance myproject/my-service-instance3: Error provisioning ServiceInstance of ClusterServiceClass (K8S: "4f6e6cf6-ffdd-425f-a2c7-3c9258ad2468" ExternalName: "user-provided-service") at ClusterServiceBroker "ups-broker": Status: 205; ErrorMessage: ; Description: ; ResponseError: I0523 15:19:16.212631 1 controller_instance.go:358] ServiceInstance "myproject/my-service-instance3": Processing adding event I0523 15:19:16.212652 1 controller.go:457] ServiceInstance "myproject/my-service-instance3": Creating client for ClusterServiceBroker ups-broker, URL: http://ups-broker.kube-service-catalog.svc.cluster.local I0523 15:19:16.213939 1 controller_instance.go:387] ServiceInstance "myproject/my-service-instance3": Provisioning a new ServiceInstance of ClusterServiceClass (K8S: "4f6e6cf6-ffdd-425f-a2c7-3c9258ad2468" ExternalName: "user-provided-service") at ClusterServiceBroker "ups-broker" I0523 15:19:16.244728 1 client.go:206] handling failure responses I0523 15:19:16.244863 1 controller_instance.go:1143] ServiceInstance "myproject/my-service-instance3": Error provisioning ServiceInstance of ClusterServiceClass (K8S: "4f6e6cf6-ffdd-425f-a2c7-3c9258ad2468" ExternalName: "user-provided-service") at ClusterServiceBroker "ups-broker": Status: 205; ErrorMessage: ; Description: ; ResponseError: I0523 15:19:16.244876 1 controller_instance.go:1144] ServiceInstance "myproject/my-service-instance3": Setting condition "Ready" to False I0523 15:19:16.244891 1 controller_instance.go:1143] ServiceInstance "myproject/my-service-instance3": Error provisioning ServiceInstance of ClusterServiceClass (K8S: "4f6e6cf6-ffdd-425f-a2c7-3c9258ad2468" ExternalName: "user-provided-service") at ClusterServiceBroker "ups-broker": Status: 205; ErrorMessage: ; Description: ; ResponseError: I0523 15:19:16.244900 1 controller_instance.go:1144] ServiceInstance "myproject/my-service-instance3": Setting condition "OrphanMitigation" to True I0523 15:19:16.244916 1 controller_instance.go:1182] ServiceInstance "myproject/my-service-instance3": Setting lastTransitionTime, condition "OrphanMitigation" to 2018-05-23 15:19:16.244887274 +0000 UTC m=+1123.208249242 I0523 15:19:16.244925 1 controller_instance.go:1143] ServiceInstance "myproject/my-service-instance3": The instance provision call failed with an ambiguous error; attempting to deprovision the instance in order to mitigate an orphaned resource and then Orphan Mitigation twice: I0523 15:19:16.514567 1 controller_instance.go:639] ServiceInstance "myproject/my-service-instance3": Sending deprovision request to broker I0523 15:19:16.547852 1 controller_instance.go:1100] ServiceInstance "myproject/my-service-instance3": Removing condition "OrphanMitigation" I0523 15:19:16.547882 1 controller_instance.go:1107] ServiceInstance "myproject/my-service-instance3": Found existing condition "OrphanMitigation": "True"; removing it I0523 15:19:16.547888 1 controller_instance.go:1143] ServiceInstance "myproject/my-service-instance3": Orphan mitigation was completed successfully I0523 15:19:16.547894 1 controller_instance.go:1144] ServiceInstance "myproject/my-service-instance3": Setting condition "Ready" to False I0523 15:19:16.547899 1 controller_instance.go:1211] ServiceInstance "myproject/my-service-instance3": Updating status I0523 15:19:16.551214 1 controller_instance.go:570] ServiceInstance "myproject/my-service-instance3": Processing deleting event I0523 15:19:16.551244 1 controller.go:457] ServiceInstance "myproject/my-service-instance3": Creating client for ClusterServiceBroker ups-broker, URL: http://ups-broker.kube-service-catalog.svc.cluster.local I0523 15:19:16.551298 1 controller_instance.go:639] ServiceInstance "myproject/my-service-instance3": Sending deprovision request to broker I0523 15:19:16.551263 1 event.go:218] Event(v1.ObjectReference{Kind:"ServiceInstance", Namespace:"myproject", Name:"my-service-instance3", UID:"a7bdd210-5e9c-11e8-a6dc-0242ac110008", APIVersion:"servicecatalog.k8s.io", ResourceVersion:"21", FieldPath:""}): type: 'Normal' reason: 'OrphanMitigationSuccessful' Orphan mitigation was completed successfully I0523 15:19:16.581732 1 controller_instance.go:1100] ServiceInstance "myproject/my-service-instance3": Removing condition "OrphanMitigation" I0523 15:19:16.581751 1 controller_instance.go:1107] ServiceInstance "myproject/my-service-instance3": Found existing condition "OrphanMitigation": "True"; removing it I0523 15:19:16.581757 1 controller_instance.go:1143] ServiceInstance "myproject/my-service-instance3": Orphan mitigation was completed successfully I0523 15:19:16.581763 1 controller_instance.go:1144] ServiceInstance "myproject/my-service-instance3": Setting condition "Ready" to False I0523 15:19:16.581769 1 controller_instance.go:1211] ServiceInstance "myproject/my-service-instance3": Updating status ```

jboyd01 commented 6 years ago

proposed quick term fix: enforce a delay of X between retry attempts provisioning an instance. X defaults to 10 minutes, but can be over ridden with a controller parameter if customers have concerns about their brokers being hit too much.

@pmorie and I have discussed, I've got the basics working. @kibbles-n-bytes does this sound ok to you for a short term solution? Keeps us from pounding brokers. Obviously we have to come back in here and fix the real issue.

The work around looks to be achievable by maintaining an in memory list of instances with a retry-no-earlier-then timestamp, perhaps set it in Orphan Mitigation and a new check against it before we do an Add. I'm close to having a working fix.