Create a simple k8s job that can install or upgrade Gateway API CRDs

robscott commented 7 months ago

What would you like to be added: We could create a simple Kubernetes Job that could be bundled with implementations to install Gateway API CRDs if they don't already exist. This job would have the following configuration:

Desired bundle version
Desired release channel
Optional: Desired subset of CRDs

This would need to have the following logic for each Gateway API CRD:

If Gateway API CRD exists: a. Skip or error if existing CRD is from a different release channel or does not have expected bundle version or release channel labels b. Upgrade to configured bundle version if existing CRD has older version c. Skip if existing CRD has version >= to version configured by job
If Gateway API CRD does not exist in cluster, install it.

All of this could theoretically be built with the registry.k8s.io/kubectl image.

Why this is needed: Many implementations want to have an easy way to bundle CRDs with their installation, but they also don't want to conflict with other installations of Gateway API in the cluster. This could provide a reasonably safe mechanism to ensure that CRDs were present and at a min version. This could also be bundled in a Helm chart https://github.com/kubernetes-sigs/gateway-api/issues/1590 to bypass some of the limitations of including CRDs directly in a Helm chart.

Note: This is not ready to work on yet. We first need to get some feedback on this idea to ensure that it actually makes sense before starting any development.

danehans commented 7 months ago

Since the CRDs are shared resources, what safeguards does this approach provide to ensure the Job does not cause breakage among different implementations? For instance, implementation A runs the Job to install version X of the CRDs and later implementation B runs the Job to install version Y of the CRDs. If the schema changes between X and Y versions, a conversion will need to take place, correct?

robscott commented 7 months ago

You're completely right @danehans, to make this safe, we'd need to establish some guardrails that could be fairly limiting. I think the only way to provide safe installation and upgrades would be to limit this to installing newer versions of CRDs included in standard channel. If an experimental CRD was present, it's possible that an upgrade could result in a breaking change.

I think the MVP for this would need to be limited to standard channel since it provides strong backwards compatibility guarantees.

In the future, we'd probably want to extend this to experimental, but that would require more advanced logic, including:

Awareness of which upgrade paths contain breaking changes and can't be automatically upgraded
Understanding of storage versions and if/when an upgrade would fail due to resources being on old storage versions
Maybe some kind of option to force an upgrade and/or override certain safeguards, but that may defeat the whole purpose

k8s-triage-robot commented 4 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

networkhermit commented 3 months ago

Taken from my comment in https://github.com/kubernetes-sigs/gateway-api/pull/2951#issuecomment-2043491641

I'm not sure using a job to bootstrap the gateway api crds is possible before the CNI get ready first. As that is the case to bootstrap cilium to use its gateway api support. I'm testing different implementations to better learn Gateway API.

robscott commented 3 months ago

I'd always assumed that Cilium's Envoy-based Gateway API implementation was deployed separately from CNI, @sayboras can you confirm if this approach would be problematic for Cilium?

sayboras commented 3 months ago

I'd always assumed that Cilium's Envoy-based Gateway API implementation was deployed separately from CNI

Yes, you are correct. The Gateway API provisioning part is part of Cilium Operator, which is separated from Cilium Agent or Cilium CNI components.

Can you confirm if this approach would be problematic for Cilium?

I don't think there will be any problem due to the reasons mentioned above.

networkhermit commented 3 months ago

@sayboras Hello!

https://github.com/cilium/cilium/blob/d913b6298123064f51a8b97495f956b5ebbe62b7/install/kubernetes/cilium/templates/cilium-gateway-api-class.yaml#L1-L11

When users use helm chart to bootstrap cilium CNI with gatewayAPI.enabled in a new cluster, is the default GatewayClass cilium the only missing resource if the gateway api crds were not installed before hand?

I currently uses a multi-step installation process:

install cilium with gatewayAPI support disabled
use fluxcd to install the gateway api crds
update cilium helm values to enable gatewayAPI to finish with the gateway api support

Is it equivalent with the following approach?

install cilium with gatewayAPI support enabled in the first run
use fluxcd to install the gateway api crds and the cilium GatewayClass

I'd always assumed that Cilium's Envoy-based Gateway API implementation was deployed separately from CNI

Yes, you are correct. The Gateway API provisioning part is part of Cilium Operator, which is separated from Cilium Agent or Cilium CNI components.

Can you confirm if this approach would be problematic for Cilium? I don't think there will be any problem due to the reasons mentioned above.

@robscott More specifically, does it mean in the future the cilium helm installation method would embed the gateway api crds bootstrap/upgrade k8s job?

sayboras commented 3 months ago

Is it equivalent with the following approach?

Not really equivalent, however, once https://github.com/cilium/cilium/issues/29207 is done, the installation process will be easier (though you might still need to provision Cilium GatewayClass outside of helm chart).

networkhermit commented 3 months ago

Is it equivalent with the following approach?

Not really equivalent, however, once cilium/cilium#29207 is done, the installation process will be easier (though you might still need to provision Cilium GatewayClass outside of helm chart).

I see. If we use k8s job (as this issue discussed) to install the gateway api crds, given that https://github.com/cilium/cilium/issues/29207 is done, so basically this k8s job and the cilium helm bootstrap can be started in parallel and got eventually installed, not leaving the k8s job in pending state. Is my understanding correct?

sayboras commented 3 months ago

I see. If we use k8s job (as this issue discussed) to install the gateway api crds, given that https://github.com/cilium/cilium/issues/29207 is done, so basically this k8s job and the cilium helm bootstrap can be started in parallel and got eventually installed, not leaving the k8s job in pending state. Is my understanding correct?

The Gateway API provisioning is part of Cilium Operator, which is separated from Cilium Agent or Cilium CNI components. So any pod will be scheduled regardless of Gateway API CRD installation. The work mentioned in https://github.com/cilium/cilium/issues/29207 is to improve user experience and avoid manual Cilium Operator restart.

networkhermit commented 3 months ago

I see. If we use k8s job (as this issue discussed) to install the gateway api crds, given that cilium/cilium#29207 is done, so basically this k8s job and the cilium helm bootstrap can be started in parallel and got eventually installed, not leaving the k8s job in pending state. Is my understanding correct?

The Gateway API provisioning is part of Cilium Operator, which is separated from Cilium Agent or Cilium CNI components. So any pod will be scheduled regardless of Gateway API CRD installation. The work mentioned in cilium/cilium#29207 is to improve user experience and avoid manual Cilium Operator restart.

Thanks for the above and previous clarification.

k8s-triage-robot commented 2 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 1 month ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-sigs/gateway-api/issues/2678#issuecomment-2156271041): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

kubernetes-sigs / gateway-api

Create a simple k8s job that can install or upgrade Gateway API CRDs #2678