aszecowka commented 6 years ago

Bug Report

What happened: Controller Manager pod is constantly restarting when I registered a malfunctioning broker (broker responds with status code HTTP 500 on "/v2/catalog" endpoint). Memory usage for ControllerManager goes up to 30MB and then it is restarted - probably because of defined memory limit. To magnify this problem, I created ~80 malfunctioning brokers, and in 35 minutes ControllerManager was restared 8 times.

What you expected to happen: Controller Manager does not restart

How to reproduce it (as minimally and precisely as possible): Install Service Catalog on minikube: minikube start --extra-config=apiserver.authorization-mode=RBAC helm init helm repo add svc-cat https://svc-catalog-charts.storage.googleapis.com helm install svc-cat/catalog \ --name catalog --namespace catalog

Write a malfunctioning broker:

package main

import (
    "log"
    "net/http"
)

func main() {
    http.HandleFunc("/v2/catalog", getCatalog)
    log.Fatal(http.ListenAndServe(":8080", nil))
}

func getCatalog(w http.ResponseWriter, r *http.Request) {
    w.WriteHeader(http.StatusInternalServerError)
}

Create Dockerfile

FROM alpine:3.7
COPY ./servicebroker-linux /root/broker

ENTRYPOINT ["/root/broker"]

Build broker image: eval (minikube docker-env) GOOS=linux GOARCH=amd64 CGO_ENABLED=0 go build -o servicebroker-linux --ldflags="-s" . docker build -t "nasty-app:1.0.0" .

Create deployment: kubectl create -f deploy.yaml

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: nasty-deploy
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nasty-app
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0
  template:
    metadata:
      labels:
        app: nasty-app
    spec:
      containers:
      - name: broker
        image: "nasty-app:1.0.0"
        imagePullPolicy: "IfNotPresent"
        ports:
        - containerPort: 8080

Create service: kubectl create -f service.yaml

apiVersion: v1
kind: Service
metadata:
  name: nasty-service
spec:
  type: NodePort
  selector:
    app: nasty-app
  ports:
    - name: broker
      port: 80
      targetPort: 8080
      protocol: TCP

Create ClusterServiceBroker: kubectl create -f charts/broker.yaml

apiVersion: servicecatalog.k8s.io/v1beta1
kind: ClusterServiceBroker
metadata:
  name: nasty-broker
spec:
  url: http://nasty-service.default.svc.cluster.local

Broker status: kubectl describe clusterservicebroker nasty-broker

...
Status:
  Conditions:
    Last Transition Time:  2018-08-14T11:55:11Z
    Message:               Error fetching catalog.Error getting broker catalog: Status: 500; ErrorMessage: <nil>; Description: <nil>; ResponseError: unexpected end of JSON input
    Reason:                ErrorFetchingCatalog
    Status:                False
    Type:                  Ready

Here you can see that pod is restarting: kubectl get pods -n catalog

NAME                                                  READY     STATUS             RESTARTS   AGE
catalog-catalog-apiserver-6bccbd5786-4grsg            2/2       Running            0          35m
catalog-catalog-controller-manager-794cc648b9-9wlrl   0/1       CrashLoopBackOff   8          35m

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.0", GitCommit:"91e7b4fd31fcd3d5f436da26c980becec37ceefe", GitTreeState:"clean", BuildDate:"2018-06-27T22:29:25Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.0", GitCommit:"fc32d2f3698e36b93322a3465f63a14e9f0eaead", GitTreeState:"clean", BuildDate:"2018-03-26T16:44:10Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

service-catalog version: 0.1.29
Cloud provider or hardware configuration: minikube v0.28.2
Do you have api aggregation enabled?
- Do you see the configmap in kube-system? yes
- Does it have all the necessary fields?
- kubectl get cm -n kube-system extension-apiserver-authentication -o yaml and look for requestheader-XXX fields looks fine
Install tools:
- Did you use helm? What were the helm arguments? Did you --set any extra values? see above
Are you trying to use ALPHA features? Did you enable them? no

piotrmiskiewicz commented 6 years ago

It seems the brokerClient is created every call to the service broker, see controller_clusterservicebroker. The function is defined here. As I see in the body, the function creates new client every call. It uses osb.NewClient, which creates golang http client every time. This is the cause of the problem. The http.Client should shared.

piotrmiskiewicz commented 6 years ago

The solution could be: create a map, which holds osbClient instance for every service broker. Instead of creating an osbClient every call, just take it from the map. Thanks to that we can have separate configuration (Basic Auth or TLS) for service brokers and no need to create a new client every time. Such changes must be applied also to all other operations like provisioning instance, bind request etc.

piotrmiskiewicz commented 6 years ago

I've created a POC service catalog changes. It is a quick fix and must not be merged. We have run it and the memory usage is not growing. No restart. See:

https://github.com/piotrmiskiewicz/service-catalog/tree/shared-http-client-for-getcatalog

How the quick fix is working:

create one osb client "prototype" when the controller is created
when the call getCatalog must be done - the client is cloned (all fields are copied except httpClient - the same instance is used)

This branch must not be merged because it does not handle multiple TLS configs and metrics won't work.

jboyd01 commented 6 years ago

Sounds to me like even when running without error there is a slow but constant memory leak, but with the broker error we have a tight retry loop where we are constantly calling getCatalog() and exacerbating the leak, do you agree? And that over enough time, even a non-erroring deployment is going to grow in memory use and eventually exceed memory allotment. I'll give this some further review.

jboyd01 commented 6 years ago

Looks to me like there are several issues we should fix here. Perhaps more then this:

1) review interaction with HTTP Client and address memory leak. It looks to me like we don't ever close the broker response in all places, could be an issue. Perhaps we need to reuse the connections as you've indicated as well. 2) our retry loop on the getCatalog() is too tight, we constantly pound the broker with retries. There should be a better backoff.

piotrmiskiewicz commented 6 years ago

The official http doc contains the sentence: "The Client's Transport typically has internal state (cached TCP connections), so Clients should be reused instead of created as needed. Clients are safe for concurrent use by multiple goroutines." If you create a http client one per minute GC should do the work and the application could work for weeks, but if you create a client every second - it is a problem. I'm looking at the http response processing and I do not see response.Body.Close() call as it is described in the official example

jboyd01 commented 6 years ago

I agree about lack of Body.close(). Last week I spent some time cleaning that up and testing, it didn't remedy the OOM issue but I agree the body should always be closed. I created https://github.com/pmorie/go-open-service-broker-client/pull/131 for it.

kubernetes-retired / service-catalog

Memory leak in Controller Manager when registered malfunctioning broker #2276

Bug Report