openshiftio / openshift.io

Red Hat OpenShift.io is an end-to-end development environment for planning, building and deploying modern applications.
https://openshift.io
97 stars 66 forks source link

A vicious Idler feedback loop #2401

Open hferentschik opened 6 years ago

hferentschik commented 6 years ago

Period: 23/02/2018 System: OpenShift Online Affected: All OpenShift.io users

Story

As part of the Jenkins Proxy/Idler roll-out ([1],[2]) we decided on the 22nd of February to put Proxy/Idler into production. We deployed the latest versions in production and then enabled the Idler using the Unleash feature toggle. The Idler started to do its job and within a few minutes, we went from 700 running Jenkins instances to around 150. One thing we noticed almost immediately was that there were repeating idle requests for a few namespaces. We created an issue [3] to investigate this and to make sure that the Idler would back-off after a specified amount of idle attempts. We also discussed the issue on Mattermost with various people and agreed that one needs to investigate why the affected instances would not idle. No further action was taken at this stage

By the 23rd of February, the OpenShift cluster started to misbehave. Initial symptoms were Jenkins instances not starting due to persistence volume mounting issues. This was investigated and at first, no connection to the Idler was made. In the end, the cluster became more and more unusable, with cluster nodes dropping their network stack.

Eventually, the Idler was determined to be the source of the network issues. It was stuck in the already recognized feedback loop trying to idle a few Jenkins pods.

At this stage, the OpenShift cluster was restarted and the Idler feature disabled in Unleash.

Takeaways

Don't underestimate potential "small" issues

When working at scale, small problems like seemingly harmless repeated calls to a service can have big impacts. Thinking about being a good citizen with the application is essential. Interaction with services needs to be protected. Concepts like bulkheads, circuit breakers and rate limiting need to be taken into account for all interaction with other services.

Make clear action items when issues occur

As mentioned, the issue was pretty much discovered almost immediately. We decided to investigate the broken deployment configuration, however, no one really took ownership of this. Once identified clear action items should have been created on who follows up on what.

Improve design of Idler

The way the Idler works at the moment is prone to these vicious feedback loops. The Idler watches OpenShift for build and deployment config changes. These events flow into the algorithm on whether a Jenkins instance should be idled. At the same time, Idler changes the Jenkins deployment config to actually idle the instance. This means the actual idle "request" triggers another change event which will be processed by the Idler. The apparent solution is to introduce rate limiting for idle/un-idle requests. However, we should take further steps as well. Watching all OpenShift events creates a lot of traffic. We are really only interested in a specific subset of events and if possible should filter only on specific events. See also [4].

Also, the Idler could be more careful in processing the deployment config change events. It could determine that the Jenkins instance is already marked for idling (based on the annotations) and hence not issue another idle request at all.

External systems

Even though the Idler flooded the cluster with requests, it seems odd that a single rouge service can bring down the whole cluster. I would expect that OpenShift has some ways of dealing with these type of situations as well. Also, the monitoring of OpenShift.io needs improvement. It should be possible to faster pinpoint a potential source of a problem. For example, I can imagine a monitor for each service measuring basic key metrics (CPU load, network traffic etc). If these metrics exceed a given threshold an alarm should be raised.

[1] https://github.com/fabric8-services/fabric8-jenkins-proxy/projects/1 [2] https://github.com/fabric8-services/fabric8-jenkins-idler/projects/1 [3] https://github.com/fabric8-services/fabric8-jenkins-idler/issues/133 [4] https://github.com/fabric8-services/fabric8-jenkins-idler/issues/130

hferentschik commented 6 years ago

To quantize the amount of requests made:

~90 req/s GET requests on deploymentconfigs ~60 req/s PATCH requests on endpoints

See also https://github.com/fabric8-services/fabric8-jenkins-idler/issues/133#issuecomment-368135408