Open FredrikAugust opened 1 month ago
Hey @FredrikAugust, thanks for raising this! This is definitely a thing we want to do, though I don't have a timeline at this point. We'll keep this issue updated. ๐
Thanks @kflynn. We're currently stuck in a little limbo right now as Traefik and Argo Rollouts (thanks to this plugin) both have moved to the stable channel, but we're unable to upgrade if linkerd2 doesn't support it so it would be great to see this in a release ๐ If there's anything I can do to help, let me know
@FredrikAugust I believe you should be able to upgrade Linkerd as long as you provide the Gateway CRDs yourself. When installing Linkerd, the CRDs chart supports a flag to omit managing the gateway resources: https://github.com/linkerd/linkerd2/blob/2c1c266582567d6693467f36e3199cf929d400d5/charts/linkerd-crds/values.yaml#L1
Note that there may be some complexity in migrating these resources to no longer be Helm-managed; but if you are able to install the Linkerd CRDs without the Gateway resources, and you are able to provide the gateway resources externally, Linkerd should be able to read v1alpha2 (etc) resource versions when the cluster has newer v1 versions.
I'd recommend trying all of this first in a non-production environment, as CRD changes can be risky.
@olix0r Thanks, that's more or less what we've been doing. We're running a custom build of the argo rollouts plugin, but didn't want to upgrade to the latest version (using v1 GRPCRoute) before getting some confirmation that linkerd2 would support it. As mentioned, Traefik and the plugin both use the stable channel.
linkerd seems to look for a specific version of the CRDs in linkerd-destination/policy container when I tried with gateway 1.1 :
2024-09-23T08:56:29.231903Z WARN kube_client::client: Unsuccessful data error parse: 404 page not found
2024-09-23T08:56:29.231915Z DEBUG kube_client::client: Unsuccessful: ErrorResponse { status: "404 Not Found", message: "\"404 page not found\\n\"", reason: "Failed to parse error data", code: 404 } (reconstruct)
2024-09-23T08:56:29.232703Z DEBUG tower::buffer::worker: buffer closing; waking pending tasks
thread 'main' panicked at policy-controller/src/main.rs:466:10:
Failed to list API group resources: Api(ErrorResponse { status: "404 Not Found", message: "\"404 page not found\\n\"", reason: "Failed to parse error data", code: 404 })
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
On 24.8.2 at least
This probably indicates that the 1.1 gateway release has stopped shipping the older versions of the CRD that Linkerd reads. Linkerd hasn't upgraded because Google Cloud continues to ship 0.7.0 in some versions. (And they still don't yet ship 1.1 anywhere.)
GKE 1.24 to 1.27.10, 1.28.4, 1.29.0: 0.7.0 GKE 1.27.10 and later, 1.28.4 and later, 1.29.0 to 1.29.2: 0.8.1 GKE 1.29.3-gke.1282001, 1.30.0-gke.1000000 and later: 1.0.0
This version skewing of the Gateway API--specifically the breaking of backwards compatibility--is obviously unfortunate; but it looks like we'll have to drop support for these clusters sooner than later.
Actually, on further inspection, it looks like the CRDs you link to include HTTPRoute v1beta1 and GRPCRoute v1alpha2, so I would expect Linkerd to be able to start properly. We'll do some more digging to identify the source of the incompatibility.
This appears to be a bug in the Gateway API CRD:
:; kubectl get crd grpcroutes.gateway.networking.k8s.io -o json | jq -r '.spec.versions[] | .name + " served=" + (.served | tostring)'
v1 served=true
v1alpha2 served=false
While the v1alpha2 CRD is provided in the 1.1 spec, it is configured to not be served by the API server.
As a workaround, it is probably suitable to change the value of served
to "true" in the 1.1 API spec.
Ah, so it appears that the v1alpha2 is only served on the experimental channel of the Gateway API.
The release notes for 1.1 call this out:
If you are already using the experimental version GRPCRoute, we recommend holding off on upgrading to the standard channel version of GRPCRoute until the controllers youโre using have been updated to support GRPCRoute v1. Until then, it is safe to upgrade to the experimental channel version of GRPCRoute in v1.1 that includes both v1alpha2 and v1 API versions.
We will not be able to upgrade to support the standard channel (i.e. to read v1) until we can expect a majority of GCP clusters to have the 1.1 CRDs. And given that they don't ship 1.1 at all yet, this is probably not going to be soon.
We should probably update our documentation to call this out explicitly.
This appears to be a bug in the Gateway API CRD:
:; kubectl get crd grpcroutes.gateway.networking.k8s.io -o json | jq -r '.spec.versions[] | .name + " served=" + (.served | tostring)' v1 served=true v1alpha2 served=false
While the v1alpha2 CRD is provided in the 1.1 spec, it is configured to not be served by the API server.
As a workaround, it is probably suitable to change the value of
served
to "true" in the 1.1 API spec.
Should this be filed as a bug in Gateway API?
"Bug" was probably a little premature. I think the situation is the following:
This is an unfortunate situation, but I do believe that it is effectively working as intended.
@FredrikAugust are you patching the "served" thing you mentioned above to work around this, or doing something else?
@genebean
Hey, we are currently not patching the served CRD, but rather using the CRDs from linkerd-crds and traefik. It's a little bit sub-optimal, but it works for us.
So, just letting both install their gateway CRDs? Does ArgoCD complain about that?
So, just letting both install their gateway CRDs? Does ArgoCD complain about that?
@genebean
Well, it's a little finicky. Since there are different versions we have to do it semi-manually.
We let linkerd-crd
install GRPCRoute
CRD, but not HTTPRoute
. The HTTPRoute
we get from Traefik with the experimental channel enabled (Helm values parameter in chart). So it requires a little bit of partial-syncing.
So linkerd-crd
and traefik
are both marked as OutOfSync as they are missing either GRPCRoute or HTTPRoute, which kind of sucks, but at least it seems to function well.
I find it a little hard to wrap my head around all the different channels and versions so if you have a better suggestion I'd be happy to hear it!
@FredrikAugust ...how do you have linkerd-crd
install GRPCRoute but not HTTPRoute? ๐ค
@kflynn we first sync the chart in ArgoCD which installs the CRD from linkerd-crd
, and then I don't remember if we deleted the CRD and installed CRDs from Traefik, or just installed Traefik and let ArgoCD handle the conflict for us. (Hence why I mentioned partial-syncing, that's how we do this using ArgoCD, but I suppose you could do it just fine manually by applying/deleting CRDs with kubectl
)
We did the same essentially for HTTPRoute, as both serve that CRD, but here we let traefik
install it. I don't remember what for to be honest, but it might be due to version requirements from the aforementioned argo-rollouts-gateway-api plugin.
A merge request was created on the helm repo to separate traefik and its CRDs into separate charts with the ability to disable the Gateway CRDs so that should help with the conflicts on ArgoCD.
What problem are you trying to solve?
Hello! I saw that Traefik helm chart has updated their Helm charts to use v1 of the stable channel for GRPCRoutes, and was wondering if linkerd2 supports that, or plans to introduce it soon? We would love to upgrade:)
How should the problem be solved?
Add the v1 CRD of GRPCRoutes.
Any alternatives you've considered?
Not really.
How would users interact with this feature?
No response
Would you like to work on this feature?
None