ServiceCatalog floods brokers when provisioning is permanently not possible.

mszostok commented 6 years ago

Hi all :)

This is a nutshell of our issue: The flow of the ServiceInstance provisioning was changed after refactoring (https://github.com/kubernetes-incubator/service-catalog/pull/1886). Now the Service Catalog has infinite loop

start provisioning -> fail, requires orphan mitigation -> perform orphan mitigation 
-> start provisioning -> …

it will do so until it finally succeeds.

Details

Scenario - Service Catalog version 0.1.3 - correct flow

The Service Catalog triggers provisioning of given ClusterServiceClass
We are responding with http code 202 (Accepted) and triggering asynchronous operation which fails because of some reason.
When Service Catalog calls _lastoperation endpoint we are returning http code 200 (OK) with body
```
{
"state": "failed",
"description": "Cannot provision instance. <reason>"
}
```
1. Service Catalog marks the ServiceInstance as failed, sets Deprovision Request as Required and does not perform any additional actions. When we call deprovision on such failed instance the Service Catalog sends deprovision request also to the broker.

Scenario - Service Catalog version 0.1.11 - broken flow

Steps 1-3 same as above.
Step 4: Service Catalog marks the ServiceInstance as failed, sets Deprovision Request as Not Required and does not perform any additional actions.

When we call deprovision on such failed instance the Service Catalog does not send deprovision request to the Service Broker, so you created a ticket to fix this https://github.com/kubernetes-incubator/service-catalog/issues/1879 but PR which should fix that https://github.com/kubernetes-incubator/service-catalog/pull/1886, caused something more. Now the Service Catalog in versions above the 0.1.11 has infinite loop

start provisioning -> fail, requires orphan mitigation -> perform orphan mitigation 
-> start provisioning -> …

This loop "spams" our brokers because there are no delays and max retries. In some cases such approach breaks our flow. We can also consider an example when someone has a Azure Service Broker and wants to provision a db service. Instance can failed e.g. because the quota in given namespace was exhausted. So repating describe action over and over does not make sense.

The question is, why are you performing the orphan mitigation if Service Broker responded with 200 (OK) status code?

In spec https://github.com/openservicebrokerapi/servicebroker/blob/v2.13/spec.md#orphans, we can find that in such case you shouldn't do that.

Could you elaborate about your new approach for provisioning? Because I cannot find any documentation about it in the Service Catalog. What's more in my opinion you should restore the previous behaviour (from version 0.1.3). When Service Broker respond with 200 status code on _lastoperation endpoint with body

{
  "state": "failed",
  "description": "Cannot provision instance. <reason>"
}

you should only mark this ServiceInstance as failed and stop pooling. Then when user sends deprovision request you should call the Service Broker. Thanks to that the given Service Broker will perform a deprovision action (same kind of garbage collector) for failed instance.

Thank you in advance

nilebox commented 6 years ago

@mszostok thanks for reporting with a lot of detail, there seems to be something to clarify in the docs or fix/alter Service Catalog behavior.

TL;DR

Retry loop in Service Catalog (for both sync and async OSB operations) was introduced to improve UX and make it as close as possible to the Kubernetes UX for reconciliation loop: retry with exponential backoff.
Orphan mitigation is potentially needed for both sync and async operations based on the discussions we had (see the detailed description below).

This loop "spams" our brokers because there are no delays and max retries.

This doesn't seem right. The retry loop should have an exponential backoff delay. If it doesn't, that's a bug, please report an issue, preferrably with a detailed log/example. As of max retries, there is a max retry duration, that is 7 days by default, but can be tuned via reconciliation-retry-duration flag, see https://github.com/kubernetes-incubator/service-catalog/blob/master/cmd/controller-manager/app/options/options.go#L119

The question is, why are you performing the orphan mitigation if Service Broker responded with 200 (OK) status code?

First of all, I would like to provide more context: Initially when I introduced the retry loop in #1765, for async operations there was no orphan mitigation. Then, it was reported as a bug, see discussion thread https://github.com/kubernetes-incubator/service-catalog/pull/1765/files#r177482025 So in order to fix this bug I sent a #1886 to perform orphan mitigation for failed async operations.

So to answer why orphan mitigation might be needed for async operations, let's first reiterate why orphan mitigation is part of OSB spec at all. In the ideal world, it should have been OSB broker's concern whether to perform a cleanup or not. And there is a general agreement that in the next major release of OSB spec (aka v3, which doesn't exist yet), we will remove the orphan mitigation requirement from the spec completely. But in v2 as a CloudFoundy's legacy, OSB spec inherited a burden of cleanup to be platform's problem (i.e. Service Catalog or CloudFoundry), not broker's.

That implies that a broker doesn't have to have a retry and cleanup loop, it could be just a "simple" proxy that reacts to the platform poking it via REST API. Which effectively means that as a result of failed deprovisioning the service managed by OSB broker could end up in the "unknown" abandoned state, potentially having created some resources that waste users' money.

Now back to your question. A "simple stateless broker" that supports async API could also not have a retry loop, just as sync one. And because the side effects of failed provisioning is unknown per OSB spec v2 (both sync and async), platform needs to perform orphan mitigation "just in case". If broker has nothing to cleanup - it could just implement this deprovisioning for non-existing instance as a no-op.

you should only mark this ServiceInstance as failed and stop pooling.

What the retry loop in Service Catalog is addressing is trying to improve the UX, making it as close as possible to the Kubernetes-native UX. What if failed provisioning was just a result of temporary network blip and it will succeed on retry? That's how Kubernetes works - it retries to deploy a Pod even if the specified Docker image doesn't exist yet, because it might be present at the next retry. Having just one chance to provision an instance doesn't sound like a Kubernetes way, or as a good UX in general.

nilebox commented 6 years ago

Also, to clarify the scope of the issue: by "when provisioning is permanently not possible" in the title it is implied that platform should either treat any async provisioning failure as a permanent one, or be able to determine whether a failure is temporary or permanent.

Unfortunately, with OSB API v2 it is impossible for the platform (Service Catalog) to know whether the failure is temporary or permanent. OSB broker just returns state: failed in the response to last_operation, there is no error categorization. This is definitely something we should address in v3 (or as a backward compatible optional field in v2), but currently we have to deal with what we have.

In 0.1.3 Service Catalog treated any failure as permanent. It was a general agreement that this was a flaw in the Service Catalog code that had been justified by our focus on beta release that only supported the "happy path" well. Since beta was released, we invested a lot of effort to improve coverage of "non-happy path", see issues filtered by non-happy-path label for example: https://github.com/kubernetes-incubator/service-catalog/issues?utf8=%E2%9C%93&q=label%3Anon-happy-path+

I hope that helps. Please do report any bugs you might find.

mszostok commented 6 years ago

@nilebox Thank you for you reply. It helps a lot!

Now we know that this behaviour is by design. Previous comment was based on version 0.1.12, but now I can see that v0.1.17 is the newest one, so we will validate the new flow one more time. If we find the same problems, we will look into the code and report issue with the details.

PS. Sorry for such late response but last week I was at conference and vacation.

kubernetes-retired / service-catalog

ServiceCatalog floods brokers when provisioning is permanently not possible. #2006