knative / serving

Kubernetes-based, scale-to-zero, request-driven compute
https://knative.dev/docs/serving/
Apache License 2.0
5.5k stars 1.15k forks source link

Backups with velero have issues #10926

Open dprotaso opened 3 years ago

dprotaso commented 3 years ago

What version of Knative?

0.21.0

Expected Behavior

Backing up and restoring a Knative service with https://velero.io/ should succeed - or at a minimum have clearer error messages.

Actual Behavior

Restore fails with confusing messages

time="2021-03-09T11:11:38Z" level=info msg="error restoring helloworld-python: admission webhook \"validation.webhook.serving.knative.dev\" denied the request: validation failed: missing field(s): metadata.labels.serving.knative.dev/service" 

In fact the labels are present but since they don't match the owner references (those are stripped by velero). The error message should be more clear. Another instance error is getting the X does not own Y error.

We can fix the label error message pretty easily. But should figure out if we're doing things that break other backup tools and decide if there's a mitigation to be had

Steps to Reproduce the Problem

senthilnathan commented 3 years ago

Steps to Reproduce the Problem

  1. Deploy a simple helloworld-python ksvc and make sure it runs fine
  2. Take velero backup
    velero backup create test1
  3. Delete helloworld-python ksvc
    kubectl delete ksvc helloworld-python
  4. Restore velero backup
    velero restore create --from-backup test1
  5. Verify that the helloworld-python ksvc is running

I observed the following errors in the velero restore log (Step 4).

time="2021-03-09T11:11:38Z" level=info msg="error restoring helloworld-python: admission webhook \"validation.webhook.serving.knative.dev\" denied the request: validation failed: missing field(s): metadata.labels.serving.knative.dev/service" logSource="pkg/restore/restore.go:1170" restore=velero/test1-20210309164041
time="2021-03-09T11:11:57Z" level=info msg="error restoring helloworld-python-00001: admission webhook \"validation.webhook.serving.knative.dev\" denied the request: validation failed: missing field(s): metadata.labels.serving.knative.dev/configuration" logSource="pkg/restore/restore.go:1170" restore=velero/test1-20210309164041
time="2021-03-09T11:12:01Z" level=info msg="error restoring helloworld-python: admission webhook \"validation.webhook.serving.knative.dev\" denied the request: validation failed: missing field(s): metadata.labels.serving.knative.dev/service" logSource="pkg/restore/restore.go:1170" restore=velero/test1-20210309164041

After this the ksvc went to 'RevisionMissing' status

$ kubectl describe ksvc helloworld-python
.....
Message:                     Revision "helloworld-python-00001" failed with message: There is an existing PodAutoscaler "helloworld-python-00001" that we do not own.
$ kubectl describe revision helloworld-python-00001
.....
Events:
  Type     Reason         Age                  From                 Message
  ----     ------         ----                 ----                 -------
  Warning  InternalError  3m54s (x9 over 90m)  revision-controller  revision: "helloworld-python-00001" does not own PodAutoscaler: "helloworld-python-00001"
$ kubectl describe podautoscaler helloworld-python-00001
......
Events:
  Type     Reason         Age                     From                      Message
  ----     ------         ----                    ----                      -------
  Warning  InternalError  6m10s (x55 over 7h24m)  podautoscaler-controller  error reconciling Metric: PA: helloworld-python-00001 does not own Metric: helloworld-python-00001
evankanderson commented 3 years ago

Figuring out how to put everything back (ala Velero) on a clean cluster may require some way to put the validation webhooks into a "paused" state and then fix things afterwards.

Probably needs a feature track design proposal, given how much defaulting and human-behavior steering we've baked in.

/kind feature-request /size XL

evankanderson commented 3 years ago

/triage accepted /remove-kind bug

/area API

(This is a reasonable request, but it's possible no one will get it it quickly)

senthilnathan commented 3 years ago

One more observation: If we backup only the top level ksvc then the restore is successful, since knative controller creates the child resources. The corresponding flag to be used with the 'velero backup' command is '--include-resources service.serving.knative.dev'. However, usage of this flag is not going to help when we have a mixture of ksvcs and k8s services/deployments to be backed up.

senthilnathan commented 2 years ago

Same/related issue filed in the Velero project: https://github.com/vmware-tanzu/velero/issues/2547

senthilnathan commented 2 years ago

Is there any reason why we don't allow the resources to adopt the orphaned child resources? It seems Kubernetes does this for the native resources: e.g. adoption of replica set by deployment.

senthilnathan commented 2 years ago
$ kubectl tree service.serving.knative.dev helloworld-python
NAMESPACE  NAME                                  READY  REASON           AGE
test-ns-1  Service/helloworld-python             False  RevisionMissing  3h31m
test-ns-1  ├─Configuration/helloworld-python     False  RevisionFailed   3h31m
test-ns-1  │ └─Revision/helloworld-python-00001  False  NotOwned         3h31m
test-ns-1  └─Route/helloworld-python             False  RevisionMissing  3h31m

After restore, the ownership is maintained all the way down to revision. The only missing ownership is the revision adopting the deployment.

senthilnathan commented 2 years ago

Please review the design proposal for this feature: https://docs.google.com/document/d/1xYok0UEKCJPgrl21Cr9QWSsuFq0g6IIWwxudMS8ed48/edit?usp=sharing

dprotaso commented 4 months ago

Is there any reason why we don't allow the resources to adopt the orphaned child resources? It seems Kubernetes does this for the native resources

We specifically have these checks for integrity in order to avoid conflict with other user created resources. eg. We will not take ownership of resources we do not control because that could break user's environments and trigger an outage.

Ideally Velero would restore owner references - I created an k/k issue a long time ago to make this easier for them - https://github.com/kubernetes/kubernetes/issues/102810

dprotaso commented 4 months ago

The corresponding flag to be used with the 'velero backup' command is '--include-resources service.serving.knative.dev'.

This will only work to restore the 'latest' revision. If you have a Knative Service with traffic splitting it will be broken.

dprotaso commented 3 months ago

I was fooling around with a backup poc tool that does owner reference restoration.

https://github.com/dprotaso/knative-backup-poc

I notice when restoring the configuration the generation gets reset and the knative controllers don't expect that so there's some minor work we can do to improve that.

/assign @dprotaso