knative / serving

Kubernetes-based, scale-to-zero, request-driven compute
https://knative.dev/docs/serving/
Apache License 2.0
5.53k stars 1.15k forks source link

Installing knative-serving yamls second time without deleting the knative-serving namespace doesn't populate the webhook certs. #12980

Open rachitchauhan43 opened 2 years ago

rachitchauhan43 commented 2 years ago

What version of Knative?

1.4.0

Expected Behavior

This is what I am doing and happening right now:

  1. Create a new NS
  2. kubectl apply -f on {serving-crds.yaml, serving-core.yaml, net-istio.yaml}
  3. Everything works fine
  4. Now, do kubectl delete -f on serving-crds.yaml + serving-core.yaml + net-istio.yaml
  5. Once everything is cleaned up, i do `kubectl apply -f again in these 3 yamls
  6. Everything should work fine

Actual Behavior

  1. Create a new NS
  2. kubectl apply -f on {serving-crds.yaml, serving-core.yaml, net-istio.yaml}
  3. Everything works fine
  4. Now, do kuebctl delete -f on serving-crds.yaml + serving-core.yaml + net-istio.yaml
  5. Once everything is cleaned up, i do `kubectl apply -f again in these 3 yamls
  6. But this time, webhooks won’t run as their certs are not populated

Without deleting the namespace completely, re-apply of knative-serving second time would fail as webhooks certs won't be populated.

Steps to Reproduce the Problem

  1. Create a new NS
  2. kubectl apply -f on {serving-crds.yaml, serving-core.yaml, net-istio.yaml}
  3. Everything works fine
  4. Now, do kuebctl delete -f on serving-crds.yaml + serving-core.yaml + net-istio.yaml
  5. Once everything is cleaned up, i do `kubectl apply -f again in these 3 yamls
  6. But this time, webhooks won’t run as their certs are not populated
psschwei commented 2 years ago

Is there a reason why you're unable to delete the namespace?

rachitchauhan43 commented 2 years ago

@psschwei : At our org, k8s is managed service by a central team. Although, it's possible to delete that and re-create but it does make the whole process tedious as I have to move out of kustomize framework to do so to use their own cli/tools.

psschwei commented 2 years ago

Just to add a little more detail here: when the serving-core.yaml file is applied, it initially creates an empty secret with for the webhook certs, then as part of its reconciliation loop the certs are populated into the secret once the leaderelection lease is acquired.

In the situation described in this issue (installing, deleting everything but the namespace, and then reinstalling), it looks like the lender lease is never acquired, and as a result the certs never get populated to the secret, and thus the failures being seen.

would need to dig into it a bit more to determine if leader election failing in this scenario is expected or a bug...

rachitchauhan43 commented 2 years ago

@psschwei : Can this issue be triaged for next release? Or do we know if this is the expected behavior?

tshak commented 2 years ago

We just ran into what is probably the same issue on Serving 1.3.2 and Operator 1.5.3 hosted in Azure (AKS). We had to perform a cluster certificate rotation. Afterwards all of the Knative Serving pods were in a CrashLoopBackoff due to invalid certificates. We tried deleting all -certs secrets. They were recreated but with metadata only. We waited for >5 mintues which should be long enough for any leader election related issue. Deleting the namespace was the only workaround that we could find.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

knative-prow-robot commented 1 year ago

This issue or pull request is stale because it has been open for 90 days with no activity.

This bot triages issues and PRs according to the following rules:

You can:

/lifecycle stale

antiClocke commented 1 year ago

" it looks like the lender lease is never acquired" i also found this,and should do this kubectl get lease -n knative-serving |grep webhook | awk '{print $1}' |xargs kubectl delete lease -n knative-serving

it looks like the lease can be acquired again, but why it happen @psschwei

antiClocke commented 1 year ago

/reopen