knative / serving

Kubernetes-based, scale-to-zero, request-driven compute
https://knative.dev/docs/serving/
Apache License 2.0
5.55k stars 1.16k forks source link

State of a failing revision is not carry-forwarded to the CR status and the logs #10267

Closed Shashankft9 closed 3 years ago

Shashankft9 commented 3 years ago

/area API

What version of Knative?

I have tested the behavior in v0.14.0 and v0.19.0

Expected Behavior

When a revision fails, and the pod shows Error state due to a bad gateway 502 response for example, this state should be captured in the logs of controller pod as https://github.com/knative/serving/blob/master/pkg/reconciler/configuration/configuration.go#L111 and similarly this should be carried forward to the status of revision, configuration and ksvc CRs through https://github.com/knative/serving/blob/master/pkg/reconciler/configuration/configuration.go#L113 for example.

Actual Behavior

I will distinguish the behavior in the two versions I have tested:

v01.14.0 - Here the logs and the status are captured correctly but the problem is that this happens only when the queue-proxy container is ready (I am doubtful if this is the reason but on observation it appears so) - when the ksvc instance is scaled down and the first request goes, even though the response might be 502 and the pod will go into Error, it wont report the log or the status of the CRs (RevisionFailed) and the pod goes from 0/2 to 1/2 and then into Error. But after some seconds, when pod the goes back to 2/2 with a new user-container(note: ksvc hasnt scaled down to zero) and I send a request, the pod goes into Error and both the logs and status of the CRs are captured correctly. So essentially, it won't report the failure for the first time and till the pod goes to 2/2 stage for the first time.

v0.19.0 - Here even the above behavior is absent, I can't see the relevant log and the status stays True for all the CRs at all times.

Steps to Reproduce the Problem

  1. A serving workload that deliberately fails when a request comes, my workload which has a problem with the unmarshalling because of the variableresult having a wrong data type:
package main

import (
  "fmt"
  "log"
  "net/http"
  "os"
  "io/ioutil"
  "encoding/json"
  "github.com/golang/glog"
)

func handler(w http.ResponseWriter, r *http.Request) {
  log.Print("helloworld: received a request")
  target := os.Getenv("TARGET")
  if target == "" {
    target = "World"
  }
  fmt.Fprintf(w, "Hello %s!\n", target)

  var result string
  data, err := ioutil.ReadAll(r.Body)
  if err != nil {
     glog.Fatalf(fmt.Sprintf("Failed to read the request body: %v\n"), err)
  }
  glog.Infof("Data sent with the request: %s", string(data))
  if err = json.Unmarshal(data, &result); err != nil{
     glog.Fatal(err)
  }
}

func main() {
  log.Print("helloworld: starting server...")

  http.HandleFunc("/", handler)

  port := os.Getenv("PORT")
  if port == "" {
    port = "8080"
  }

  log.Printf("helloworld: listening on port %s", port)
  log.Fatal(http.ListenAndServe(fmt.Sprintf(":%s", port), nil))
}
  1. Deploying it:
    apiVersion: serving.knative.dev/v1
    kind: Service
    metadata:
    name: failing-func
    namespace: serverless
    spec:
    template:
    metadata:
      name: failing-func-v1
      namespace: serverless
      annotations:
        autoscaling.knative.dev/window: "180s"
        autoscaling.knative.dev/metric: "rps"
        autoscaling.knative.dev/target: "100"
    spec:
      containers:
        - image: xxx
          imagePullPolicy: Always
          env:
            - name: TARGET
              value: "Go Sample v1"
          ports:
            - containerPort: 8000
  2. Sending a request:
    curl http://xxx -H "Host: failing-func.serverless.xxx.com" -vvv
    * Rebuilt URL to: http://xxx/
    *   Trying xxx...
    * TCP_NODELAY set
    * Connected to xxx (xxx) port 80 (#0)
    > GET / HTTP/1.1
    > Host: failing-func.serverless.xxx.com
    > User-Agent: curl/7.61.1
    > Accept: */*
    >
    < HTTP/1.1 502 Bad Gateway
    < content-length: 4
    < content-type: text/plain; charset=utf-8
    < date: Fri, 04 Dec 2020 09:12:01 GMT
    < x-content-type-options: nosniff
    < x-envoy-upstream-service-time: 4769
    < server: istio-envoy
    <
    EOF
markusthoemmes commented 3 years ago

This is indeed a hole in our readiness state. A revision that was once considered Ready will currently never become unready.

Shashankft9 commented 3 years ago

Do you think that this known hole also covers the behavior I noticed in the v0.14.0 - the status never reporting unready state when the first request comes on the scaled down to zero ksvc? I can also try using the v0.18.0 to verify, I remember Dave wanted to test this on v0.18.0 here: https://gist.github.com/dprotaso/56c1fd920291eff29cce48f0501732a0

ref: https://knative.slack.com/archives/CA4DNJ9A4/p1606458112460300

Shashankft9 commented 3 years ago

yep, same with v0.18.0, just tested the flow. The status remains True.

mattmoor commented 3 years ago

I'm wondering about how legitimately we should consider this for a few reasons:

  1. log.Fatal as a failure mode seems extreme, even a panic in Go is generally handled by the http server more gracefully IIRC. Generally the term I have used for requests that kill the server is a "query of death".
  2. The default readiness check (what's used here) is our built-in tcp probe, which is clearly inadequate to establish this. If you add an http get probe that issues a malformed request, it would never become ready in the first place IIUC.

I think that there are two scenarios worth spelling out:

  1. Occasional "query of death", which we will have a very hard time protecting against.
  2. All "real" traffic is a "query of death", which we might be able to protect against post-facto.

I think that if we can manage 2. then bad changes rolled out via latestRevision: true either with or without the new rollout stuff @vagababov has been working on could theoretically be rolled back, which would be a nice protection to offer folks, but we need to work out the details very carefully.

mattmoor commented 3 years ago

cc @dprotaso

dprotaso commented 3 years ago

slack thread discussion: https://knative.slack.com/archives/CA4DNJ9A4/p1606458112460300

evankanderson commented 3 years ago

/area API

Is there more that needs to be done here? @dprotaso , can you put a priority/hint here?

evankanderson commented 3 years ago

/triage needs-user-input

Shashankft9 commented 3 years ago

Hey, I just noticed the label, is there any more info I could provide?

dprotaso commented 3 years ago

I think Evan was looking for input from myself. I'll surface what I mentioned slack.

We can't automatically distinguish when a runtime failure should gate readiness of the revision. We need hints from the user via readiness/liveness probes.

Matt's description of 2) in my opinion should be managed by some higher level continuous delivery tool. That would not only capture runtime errors (counting HTTP status codes) but also performance regressions (changes in request latency).

I'm going to close this issue out and recommend maybe looking at spinnaker and other CD tools

dprotaso commented 3 years ago

I made a follow up issue to confirm that if we have the right hints (readiness/liveness failing) that we behave properly.

https://github.com/knative/serving/issues/11531