knative / pkg

Knative common packages
Apache License 2.0
255 stars 329 forks source link

webhook EOF errors #1509

Open mattmoor opened 4 years ago

mattmoor commented 4 years ago

/area API /area test-and-release /kind bug

Expected Behavior

When we run our e2e tests with chaos there are no failures due to the webhook shutting down.

Actual Behavior

We intermittently see failures like this: Post https://eventing-webhook.knative-eventing-qh1fjbnng8.svc:443/resource-conversion?timeout=30s: EOF ever since we enabled webhook chaos.

mattmoor commented 4 years ago

cc @vaikas

mattmoor commented 4 years ago

I already made two changed to try and mitigate this:

  1. Bumped the terminationGracePeriodSecond in our webhook pods to 300 to give the webhook itself greater control over the drain duration.
  2. Bumped the network.DefaultDrainTimeout: https://github.com/knative/pkg/pull/1501 to give the API Server more time to observe the endpoints change.
mattmoor commented 4 years ago

From the Go http.Server Shutdown documentation:

When Shutdown is called, Serve, ListenAndServe, and ListenAndServeTLS immediately return ErrServerClosed. Make sure the program doesn't exit and waits instead for Shutdown to return.

I wondered if this might be our problem, but we call ListenAndServeTLS here: https://github.com/knative/pkg/blob/f1b82401dc8a5951d6034be93a3a2423556a3104/webhook/webhook.go#L207-L210 and seem to properly check for ErrServerClosed

mattmoor commented 4 years ago

I started a thread in #sig-api-machinery here: https://kubernetes.slack.com/archives/C0EG7JC6T/p1594910493127400

mattmoor commented 4 years ago

Alright, so one of the things I have been wondering is around Keep-Alives and whether that might be a reason the API server takes so long to realize an endpoint is no longer good. I noticed the following comment on Go's SetKeepAlivesEnabled:

SetKeepAlivesEnabled controls whether HTTP keep-alives are enabled. By default, keep-alives are always enabled. Only very resource-constrained environments or servers in the process of shutting down should disable them.

... emphasis mine.

Now it turns out that Go's http logic is already smart about disabling keep-alive when it is shutting down:

func (s *Server) doKeepAlives() bool {
    return atomic.LoadInt32(&s.disableKeepAlives) == 0 && !s.shuttingDown()
}

However, when we lame duck we aren't yet .shuttingDown() so I believe keep-alives will continue throughout our network.DefaultDrainTimeout until Shutdown() is actually called.

I believe testing this should be as simple as calling server.SetKeepAlivesEnabled(false) here: https://github.com/knative/pkg/blob/25f2aa6443803868e6562dca5e57626f2309302d/webhook/webhook.go#L221

However, if this works then we should probably consider doing something similar across our various dataplane components as well. cc @tcnghia

mattmoor commented 4 years ago

We believe that @vaikas is still seeing some webhook failures that should be synced past the above change, so we aren't out of the woods yet.

mattmoor commented 4 years ago

The new "Drainer" handler in the linked PR uses a dynamic drain timeout where it waits for at least network.DefaultDrainTimeout after the last request before terminating the webhook. With that constant set to 45s, let's see if there are any failures... 🤞

mattmoor commented 4 years ago

One of the two flakes in eventing yesterday was this:

TestChannelDataPlaneSuccess/InMemoryChannel-messaging.knative.dev/v1/full-event_encoding_binary_version_1.0: creation.go:108: Failed to create subscription "full-event-binary-10-sub": Internal error occurred: failed calling webhook "webhook.eventing.knative.dev": Post https://eventing-webhook.knative-eventing-sfqaktzdqa.svc:443/defaulting?timeout=2s: EOF

I need to track down whether that CI run had the drainer stuff in yet.

mattmoor commented 4 years ago

Hmm looks like it was at 239f7fc, which includes the drainer.

Link to the full run: https://prow.knative.dev/view/gcs/knative-prow/logs/ci-knative-eventing-continuous/1284652424807583744

vaikas commented 4 years ago

Tomorrow I'm going to add a workaround for this by using Retry loops, just like we do for when pod creates (or configmaps fail) with very specific error cases. This will be a workaround for now hopefully so that we'll be able to not get tests that fail if they hit this condition. Reconcilers should be resilient to this, but tests that use create*orFail will not, so this will at least cut down some noise. I'll make sure to log these so at least we see how often this happens.

markusthoemmes commented 4 years ago

Still happening in serving too I believe: https://prow.knative.dev/view/gcs/knative-prow/logs/ci-knative-serving-continuous/1285454712115564545

mattmoor commented 4 years ago

It definitely is. Open to suggestions on how we might keep chipping away at this / instrument / debug / ...

mattmoor commented 4 years ago

I had an idea chatting with @markusthoemmes a bit on slack.

Right now there's a lot of machinery standing between us and the EOFs that it is making them harder to debug than they should be. The thought is: can we have a more dedicated probing process that we can use to reproduce this (maybe more consistently)?

markusthoemmes commented 4 years ago

I think we should work on getting a small reproducer of this. Create a small webhook, create a ton requests and churn through them quickly or something.

mattmoor commented 4 years ago

So I created a tiny little Go binary that uses vegeta to load the Serving resource-validation webhook, and found something interesting that may be related.

If I have vegeta send 1 request every 20ms for 3 minutes, then everything completes successfully, and there's an average latency of 18ms (typically). This remains true even when run under chaos.

If I have vegeta send 1 request every 10ms for 3 minutes, then latency climbs (pretty quickly), and there's an average latency of ~17.5s seconds (the peak is >30s). I have been playing with values to try and push this latency ever higher to try and trigger an upstream EOF, but haven't managed to yet.

Here's the simple little load-generator:

package main

import (
    "bytes"
    "encoding/json"
    "log"
    "net/http"
    "time"

    vegeta "github.com/tsenart/vegeta/lib"
    admissionv1 "k8s.io/api/admission/v1"
    corev1 "k8s.io/api/core/v1"
    metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
    "k8s.io/apimachinery/pkg/runtime"
    "knative.dev/pkg/ptr"
    v1 "knative.dev/serving/pkg/apis/serving/v1"
)

func main() {
    ksvc := &v1.Service{
        TypeMeta: metav1.TypeMeta{
            APIVersion: "serving.knative.dev/v1",
            Kind:       "Service",
        },
        ObjectMeta: metav1.ObjectMeta{
            Namespace: "default",
            Name:      "no-matter",
        },
        Spec: v1.ServiceSpec{
            ConfigurationSpec: v1.ConfigurationSpec{
                Template: v1.RevisionTemplateSpec{
                    Spec: v1.RevisionSpec{
                        PodSpec: corev1.PodSpec{
                            Containers: []corev1.Container{{
                                Image: "busybox",
                            }},
                        },
                    },
                },
            },
            RouteSpec: v1.RouteSpec{
                Traffic: []v1.TrafficTarget{{
                    LatestRevision: ptr.Bool(true),
                    Percent:        ptr.Int64(100),
                }},
            },
        },
    }

    gvk := v1.SchemeGroupVersion.WithKind("Service")
    review := admissionv1.AdmissionReview{
        Request: &admissionv1.AdmissionRequest{
            Operation: admissionv1.Create,
            Kind: metav1.GroupVersionKind{
                Group:   gvk.Group,
                Version: gvk.Version,
                Kind:    gvk.Kind,
            },
            Resource: metav1.GroupVersionResource{
                Group:    gvk.Group,
                Version:  gvk.Version,
                Resource: "services",
            },
            Object: runtime.RawExtension{
                Object: ksvc,
            },
        },
    }

    body := new(bytes.Buffer)
    if err := json.NewEncoder(body).Encode(&review); err != nil {
        log.Fatalf("Error serializing admission request: %v", err)
    }

    rate := vegeta.Rate{Freq: 1, Per: 10 * time.Millisecond} // 10 fails, 20 succeeds.
    targeter := vegeta.NewStaticTargeter(vegeta.Target{
        Method: http.MethodPost,
        URL:    "https://webhook.knative-serving.svc.cluster.local/resource-validation",
        Body:   body.Bytes(),
        Header: http.Header{
            "Content-Type": []string{"application/json"},
        },
    })
    // Responses should be fast, but we set a high timeout to see
    // if we can provoke a server-side termination.
    attacker := vegeta.NewAttacker(vegeta.Timeout(3*time.Minute), vegeta.KeepAlive(false))

    log.Print("Starting the attack.")
    success, error := 0, 0
    var totalLatency time.Duration
    for result := range attacker.Attack(targeter, rate, 3*time.Minute, "webhook-probe") {
        if result.Code != http.StatusOK {
            log.Printf("FAILURE[%v: %d: %s] %s", result.Timestamp, result.Code, result.Error, string(result.Body))
            error++
        } else {
            totalLatency += result.Latency
            log.Printf("SUCCESS[%v] %v", result.Timestamp, result.Latency)
            success++
        }
    }
    avgLatency := (totalLatency.Milliseconds() / int64(success))

    if error > 0 {
        log.Fatalf("Completed with %d/%d errors (average latency: %dms).", error, error+success, avgLatency)
    } else {
        log.Printf("Completed %d requests without errors (average latency: %dms).", success, avgLatency)
    }
}

Here's the simply job.yaml to launch it:

apiVersion: batch/v1
kind: Job
metadata:
  name: webhook-prober
  namespace: default
spec:
  parallelism: 1
  backoffLimit: 0
  template:
    metadata:
      labels:
        app: webhook-prober
    spec:
      containers:
      - name: webhook-prober
        # Point me to where the above is saved!
        image: ko://knative.dev/serving/test/test_images/webhook-prober
        resources:
          requests:
            cpu: 100m
      restartPolicy: Never

cc @vagababov @dprotaso

mattmoor commented 4 years ago

Interestingly without chaos I see a higher average/max latency (28s / 59s respectively), probably in part because the pod shutdown is graceful and we briefly get an additional pod of capacity 🤔

vagababov commented 4 years ago

/assign I'll look next week to see what it is.

mattmoor commented 4 years ago

At 1 request per 5ms for 10 minutes:

2020/07/26 17:39:02 FAILURE[2020-07-26 17:38:55.523011525 +0000 UTC m=+59.225520149: 0: Post "https://webhook.knative-serving.svc.cluster.local/resource-validation": EOF]

Not sure if this is what we're seeing, but it's the best lead I've seen 😅

mattmoor commented 4 years ago

That's with chaos running btw.

Also cc @tcnghia as he was chasing webhook timeouts at one point, and this finding is related 😉

dprotaso commented 3 years ago
Operation: admissionv1.Create,

FYI - this triggers PodSpec dry-run which is enabled by default in serving

dprotaso commented 3 years ago

Do you see timeouts?

2020/07/29 03:30:25 FAILURE[2020-07-29 03:29:54.078907339 +0000 UTC m=+168.179666060: 0: Post "https://webhook.knative-serving.svc.cluster.local/resource-validation": dial tcp 0.0.0.0:0->10.59.245.84:443: connect: connection timed out]
dprotaso commented 3 years ago

Some data from me

1) PodSpec dry-run disabled 2020/07/29 03:30:25 Completed with 2832/36000 errors (average latency: 11433ms). 2020/07/29 03:37:56 Completed with 3029/36000 errors (average latency: 12085ms).

2) PodSpec dry-run enabled 2020/07/29 03:20:16 Completed with 29582/36000 errors (average latency: 87396ms) 2020/07/29 03:09:51 Completed with 28229/36000 errors (average latency: 78468ms).

I bumped the webhook memory to 1Gi and allocated 1 CPU (1000m). I also disabled readiness/liveness probes since failures there would cause restarts

vagababov commented 3 years ago

The profile I had mostly shows usage in TLS client, so I guess that is in line with what you see. May be we need to cache those connections

dprotaso commented 3 years ago

To avoid timeouts completely I needed to bump my CPU limit even higher - memory was flat

Screen Shot 2020-07-29 at 9 20 04 AM

The result: 2020/07/29 13:09:41 Completed 36000 requests without errors (average latency: 12ms).

Turning on PodSpec dry-run resulted in memory climbing and eventually the webhook getting an OOMKilled error. Then I started to see EOF's for the first time

Screen Shot 2020-07-29 at 9 21 35 AM

2m23s       Warning   OOMKilling   node/gke-dave-knative-default-pool-0f1da796-39vh   Memory cgroup out of memory: Kill process 1340269 (webhook) score 1997 or sacrifice child
Killed process 1340269 (webhook) total-vm:1717176kB, anon-rss:1020068kB, file-rss:30936kB, shmem-rss:0kB
54s         Warning   OOMKilling   node/gke-dave-knative-default-pool-0f1da796-39vh   Memory cgroup out of memory: Kill process 1353620 (webhook) score 1997 or sacrifice child
Killed process 1353620 (webhook) total-vm:1717176kB, anon-rss:1020264kB, file-rss:30040kB, shmem-rss:0kB
dprotaso commented 3 years ago

I don't think this explains eventing's use case though - I'll change the runner to do conversions

markusthoemmes commented 3 years ago

It doesn't explain Serving's either, does it? We'd see errors without the chaos otherwise too I'd think.

dprotaso commented 3 years ago

I was referring to eventing not performing PodSpec dry-runs

We'd see errors without the chaos otherwise too I'd think.

What do you mean?

I'm actually just testing with a single webhook instance.

vagababov commented 3 years ago

So I replaced prober to probe routes rather than services Completed 18000 requests without errors (average latency: 12ms). Probing service Completed 6000 requests without errors (average latency: 28112ms). With last requests taking about 58s to complete.

I am getting more and more convinced that it is the dry run validation client ratelimiting.

/cc @whaught

whaught commented 3 years ago

If dry-run is implicated but not well understood, we should move the feature to Disabled. With Beta features the standard should be to roll-back first and ask questions later.

vagababov commented 3 years ago

Logging time it takes to call: err = c.client.Post().Namespace(c.ns).Resource("pods").VersionedParams(&opts, scheme.ParameterCodec).Body(pod).Do().Into(result) it definitely shows linear growth that the time to make the call takes. Route, not requiring this call, is pretty consistently constant at even higher loads.

vagababov commented 3 years ago

The other thing is that 100QPS requires CPU~600m (56X top shows) even for simple requests.

evankanderson commented 3 years ago

If dry-run is implicated but not well understood, we should move the feature to Disabled. With Beta features the standard should be to roll-back first and ask questions later.

As an additional point, since this is behind a feature flag, it should be very easy to flip the flag to disabled and see if it changes the shape of the curve and either confirm or rule out the feature as a culprit. (The culprit still could be in the scaffolding code.)

markusthoemmes commented 3 years ago

While this seems like an issue in itself, I'm still not seeing how it's related to the chaos duck, unless the maximum latency surpasses 300s, which is the time the webhook is allowed to take before harshly killed. The drain time shouldn't matter as that's only a timer to wait for accepting new connections. Existing connections should be just fine and handled by the Shutdown call :thinking:.

That's also what I was referring to above @dprotaso. My main question is: Why is this a problem under chaos but not in a system without chaos?

vagababov commented 3 years ago

It might not be related to chaosduck per se. But with this out we can re-try chaos duck.

dprotaso commented 3 years ago

webhook timeout's have happened in serving prior to chaos and we bumped the request limits https://github.com/knative/serving/pull/8048

I think the eventing webhook is even more under-provisioned. I'll make a PR

dprotaso commented 3 years ago

To clarify I don't disagree with

Why is this a problem under chaos but not in a system without chaos?

Have we seen EOF flakes in serving?

vagababov commented 3 years ago

I have under load.

mattmoor commented 3 years ago

Have we seen EOF flakes in serving

Yes, which is why we disabled the chaosduck on the webhook.

While this seems like an issue in itself, I'm still not seeing how it's related to the chaos duck, unless the maximum latency surpasses 300s, which is the time the webhook is allowed to take before harshly killed

Hmm, you are right. I was attempting to drive the latency up to 45s (network.DefaultDrainTimeout), but for my theory to have held water it should have taken until 300s and the queued calls should have drained gracefully via the shutdown... 🤔

dprotaso commented 3 years ago

I've got a few runs where I see the EOF in eventing here: https://github.com/knative/eventing/pull/3758 if anyone cares to take a look.

I didn't see anything in the webhook logs that would suggest why the EOF happened. It'd be nice to have API server logs

dprotaso commented 3 years ago

I'll poke at manually hitting the eventing webhook

dprotaso commented 3 years ago

running things manually - eventing webhook seems to falls over due to liveness/readiness probe latency being larger than the timeout (similar to serving)

when I disabled the probes it ran fine - this is with the limits raised to 500m & 500Mi

Screen Shot 2020-08-04 at 8 51 43 PM

2020/08/05 00:46:58 Completed 18000 requests without errors (average latency: 10842ms).

github-actions[bot] commented 3 years ago

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

mattmoor commented 3 years ago

/lifecycle frozen

dprotaso commented 3 years ago

/unassign @vagababov

dprotaso commented 3 years ago

Interestingly one way to reproduce this consistently is to panic in the webhook when handling a request (ie. the defaulting logic)

golang's http server recovers these panics and logs an error.

We should potentially recover ourselves so we can return an 'internal server' type error

dprotaso commented 3 years ago

As part of https://github.com/knative/serving/issues/11225 I encountered EOF's/context deadline exceeded. After adding some tracing I've seen our web hooks respond <10ms but then the API server still returns a timeout.

So I don't think this is isolated to just our web hooks.

My next steps is to start testing with a non-managed k8s service to be able to get API server logs