knative / serving

Kubernetes-based, scale-to-zero, request-driven compute
https://knative.dev/docs/serving/
Apache License 2.0
5.57k stars 1.16k forks source link

Revision stays in ContainerMissing condition forever after a temporary failure of digest resolution #15466

Open maschmid opened 3 months ago

maschmid commented 3 months ago

/area reconciler

What version of Knative?

1.14

Expected Behavior

After a temporary error in digest resolution causes a ContainerHealthy condition to be False due to ContainerMissing , when the digest resolution is eventually successful, the ContainerHealthy should be True.

Actual Behavior

After a temporary error in digest resolution, when the digest resolution is eventually successful, the Revision stays in this inconsistent broken state:

status:
  actualReplicas: 1
  conditions:
  - lastTransitionTime: "2024-08-12T22:30:16Z"
    severity: Info
    status: "True"
    type: Active
  - lastTransitionTime: "2024-08-12T22:28:04Z"
    message: 'Unable to fetch image "image-registry.openshift-image-registry.svc:5000/ocf-qe-images/receiverhttp":
      failed to resolve image to digest: GET https://image-registry.openshift-image-registry.svc:5000/openshift/token?scope=repository%3Aocf-qe-images%2Freceiverhttp%3Apull&service=:
      unexpected status code 401 Unauthorized'
    reason: ContainerMissing
    status: "False"
    type: ContainerHealthy
  - lastTransitionTime: "2024-08-12T22:28:04Z"
    message: 'Unable to fetch image "image-registry.openshift-image-registry.svc:5000/ocf-qe-images/receiverhttp":
      failed to resolve image to digest: GET https://image-registry.openshift-image-registry.svc:5000/openshift/token?scope=repository%3Aocf-qe-images%2Freceiverhttp%3Apull&service=:
      unexpected status code 401 Unauthorized'
    reason: ContainerMissing
    status: "False"
    type: Ready
  - lastTransitionTime: "2024-08-12T22:30:12Z"
    status: "True"
    type: ResourcesAvailable
  containerStatuses:
  - imageDigest: image-registry.openshift-image-registry.svc:5000/ocf-qe-images/receiverhttp@sha256:e915478407c5c882346c4fc72078007fd2511d9e1796345db1873facafddf836
    name: user-container
  desiredReplicas: 1
  observedGeneration: 1

Notice the containerStatuses showing the resolved image digest , the deployments are Ready (with ResourcesAvailable being True), but the ContainerHealthy still being False with the original digest resolution error.

Steps to Reproduce the Problem

Currently does not have a reproducer, noticed the problem on a long running test

knative-prow[bot] commented 3 months ago

@maschmid: The label(s) area/reconciler cannot be applied, because the repository doesn't have them.

In response to [this](https://github.com/knative/serving/issues/15466): >/area reconciler > >## What version of Knative? >1.14 > >## Expected Behavior > > >After a temporary error in digest resolution causes a ContainerHealthy condition to be False due to ContainerMissing , when the digest resolution is eventually successful, the ContainerHealthy should be True. > >## Actual Behavior > > >After a temporary error in digest resolution, when the digest resolution is eventually successful, the Revision stays in this inconsistent broken state: > >``` >status: > actualReplicas: 1 > conditions: > - lastTransitionTime: "2024-08-12T22:30:16Z" > severity: Info > status: "True" > type: Active > - lastTransitionTime: "2024-08-12T22:28:04Z" > message: 'Unable to fetch image "image-registry.openshift-image-registry.svc:5000/ocf-qe-images/receiverhttp": > failed to resolve image to digest: GET https://image-registry.openshift-image-registry.svc:5000/openshift/token?scope=repository%3Aocf-qe-images%2Freceiverhttp%3Apull&service=: > unexpected status code 401 Unauthorized' > reason: ContainerMissing > status: "False" > type: ContainerHealthy > - lastTransitionTime: "2024-08-12T22:28:04Z" > message: 'Unable to fetch image "image-registry.openshift-image-registry.svc:5000/ocf-qe-images/receiverhttp": > failed to resolve image to digest: GET https://image-registry.openshift-image-registry.svc:5000/openshift/token?scope=repository%3Aocf-qe-images%2Freceiverhttp%3Apull&service=: > unexpected status code 401 Unauthorized' > reason: ContainerMissing > status: "False" > type: Ready > - lastTransitionTime: "2024-08-12T22:30:12Z" > status: "True" > type: ResourcesAvailable > containerStatuses: > - imageDigest: image-registry.openshift-image-registry.svc:5000/ocf-qe-images/receiverhttp@sha256:e915478407c5c882346c4fc72078007fd2511d9e1796345db1873facafddf836 > name: user-container > desiredReplicas: 1 > observedGeneration: 1 >``` > >Notice the containerStatuses showing the resolved image digest , the deployments are Ready (with ResourcesAvailable being True), but the ContainerHealthy still being False with the original digest resolution error. > >## Steps to Reproduce the Problem > > >Currently does not have a reproducer, noticed the problem on a long running test Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
ReToCode commented 3 months ago

cc @dprotaso @skonto

skonto commented 2 months ago

@dprotaso gentle ping I tried to reproduce locally but no luck.

maschmid commented 2 months ago

https://github.com/knative/serving/issues/15487 could be a similar issue.

skonto commented 1 month ago

15503 fixes this one too, correct @maschmid ?