Revisions not always fail when pods in crashloopbackoff

knative / serving

Kubernetes-based, scale-to-zero, request-driven compute

https://knative.dev/docs/serving/

Apache License 2.0

5.57k stars 1.16k forks source link

Revisions not always fail when pods in crashloopbackoff #10194

Closed kkwekkeboom closed 3 years ago

kkwekkeboom commented 3 years ago

In what area(s)?

/area autoscale

What version of Knative?

0.17.1

Expected Behavior

When creating a revision with a crashing user-container the revision should not become ready.

Actual Behavior

The revision sometimes becomes ready, sometimes not.

Steps to Reproduce the Problem

Use https://knative.dev/docs/serving/samples/hello-world/helloworld-python/ and edit app.py such that app.py has invalid python syntax. Deploy the service with

        autoscaling.knative.dev/maxScale: "1"
        autoscaling.knative.dev/minScale: "1"

8/10 revisions get ready, 2/10 RevisionMissing

My suspicion is that overly aggressive probing of the user container leads to the false conclusion the service is ready. In fact gunicorn is listening on port 8080 for a little moment before crashing

julz commented 3 years ago

Hi @kkwekkeboom what ReadinessProbe (if any) do you have set on your ksvc? Have you tried setting up a ReadinessProbe with a successThreshold greater than 0, and if so did/does this help?

kkwekkeboom commented 3 years ago

Using a readinessprobe does not make a difference

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld-python8
  namespace: default
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/maxScale: "1"
        autoscaling.knative.dev/minScale: "1"
    spec:
      containers:
        - image: dev.local/hello-world-vcrash
          imagePullPolicy: IfNotPresent
          env:
            - name: TARGET
              value: "Python Sample v1"
          readinessProbe:
            initialDelaySeconds: 10
            periodSeconds: 10
            timeoutSeconds: 10
            successThreshold: 3
            failureThreshold: 1

Upgrading to v0.19 does not make a difference

markusthoemmes commented 3 years ago

Our default readiness probe is fairly "dumb" in that it only asserts a TCP connection can be established, which you have found can cause races in cases like yours.

If you specify a readiness probe that actually checks the server responds correctly, you should get the desired behavior.

kkwekkeboom commented 3 years ago

Thanks

readinessProbe:
            httpGet:
              path: /status
            initialDelaySeconds: 0
            periodSeconds: 1
            timeoutSeconds: 10
            successThreshold: 1
            failureThreshold: 1

works quite well