knative / serving

Kubernetes-based, scale-to-zero, request-driven compute
https://knative.dev/docs/serving/
Apache License 2.0
5.56k stars 1.16k forks source link

Not possible to define livenessProbes or readinessProbes on multi-container #12615

Closed IBMRob closed 2 years ago

IBMRob commented 2 years ago

What version of Knative?

0.9.x 0.10.x 0.11.x Output of git describe --dirty Red Hat OpenShift Serverless 1.20

Expected Behavior

Be able to create a multi-container service which has probes on both the containers such. Being unable to specify probes on the additional containers could result in the container returning ready when some of the containers are not fully started meaning requests would fail

Actual Behavior

Unable to define service as it returns admission webhook "validation.webhook.serving.knative.dev" denied the request: validation failed: must not set the field(s): spec.template.spec.containers[1].livenessProbe, spec.template.spec.containers[1].livenessProbe.failureThreshold, spec.template.spec.containers[1].livenessProbe.initialDelaySeconds, spec.template.spec.containers[1].livenessProbe.periodSeconds, spec.template.spec.containers[1].livenessProbe.successThreshold, spec.template.spec.containers[1].livenessProbe.timeoutSeconds, spec.template.spec.containers[1].readinessProbe, spec.template.spec.containers[1].readinessProbe.failureThreshold, spec.template.spec.containers[1].readinessProbe.initialDelaySeconds, spec.template.spec.containers[1].readinessProbe.periodSeconds, spec.template.spec.containers[1].readinessProbe.successThreshold,

Steps to Reproduce the Problem

Attempt to deploy the following service

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld-go-2
spec:
  template:
    metadata:
      creationTimestamp: null
    spec:
      containerConcurrency: 0
      containers:
        - image: 'jimmysong/knative-samples-helloworld-go:latest'
          livenessProbe:
            initialDelaySeconds: 20
            periodSeconds: 10
            timeoutSeconds: 5
            failureThreshold: 1
            successThreshold: 1
            tcpSocket:
              port: 0
          name: user-container
          ports:
            - containerPort: 8080
              name: http1
              protocol: TCP
          readinessProbe:
            initialDelaySeconds: 2
            periodSeconds: 10
            timeoutSeconds: 5
            failureThreshold: 1
            successThreshold: 1
            tcpSocket:
              port: 0
          resources: {}
        - image: 'jimmysong/knative-samples-helloworld-go:latest'
          livenessProbe:
            initialDelaySeconds: 20
            periodSeconds: 10
            timeoutSeconds: 5
            failureThreshold: 1
            successThreshold: 1
            tcpSocket:
              port: 0
          name: user-container
          readinessProbe:
            initialDelaySeconds: 2
            periodSeconds: 10
            timeoutSeconds: 5
            failureThreshold: 1
            successThreshold: 1
            tcpSocket:
              port: 0
          resources: {}
      enableServiceLinks: false
      timeoutSeconds: 300
  traffic:
    - latestRevision: true
      percent: 100
skonto commented 2 years ago

Hey @IBMRob related discussion on this long standing issue here, this is can considered as duplicate. Some issues blocking this described here. @julz @dprotaso should we move forward with this (I can work on this)? Do we have enough test coverage to start refactoring this part? How about starting with relaxing validation for multi-containers and then continue with a design doc to capture semantics for readiness/liveness probes for all cases? This area needs improvement as we are hitting other issues like in https://github.com/knative/serving/issues/12571#issuecomment-1035648108.

As a temporary workaround you could try use the readiness/liveness probe of the user container in order to verify the status of the other containers and reply back accordingly (do local requests).

IBMRob commented 2 years ago

Hi @skonto - Yeah I am currently trying to update our ready checks in the primary container to also check the secondary container. Unfortunately run into another problem. Happy to close this pending the existing discussions.

github-actions[bot] commented 2 years ago

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.