Closed codex70 closed 2 years ago
First, are you using the kind: MeshService
backend? The ResolvedRefs
status condition you're seeing seems to indicate a failure to resolve a Kubernetes Service named test-service
in the test
namespace, rather than a Consul service - the standard kind: Service
Route backend will only find Kuberentes Services in the same Kubernetes cluster, not Consul services outside the Kubernetes cluster to which Consul API Gateway is deployed.
While this doesn't seem to be documented, I believe the functionality of forwarding traffic to Consul services in other datacenters is not yet supported. Consul service resolution from MeshService
uses findCatalogService
and doesn't specify a Datacenter
parameter for api.QueryOptions
, which I believe would limit results to Consul services registered in the same datacenter as the Consul agent serving the API request. If you're trying to reach a service from a different Kubernetes cluster registered in the same Consul datacenter though, this may work, but I haven't tested to confirm. https://github.com/hashicorp/consul-api-gateway/blob/145bcc9bf009a21b2170f7c27928bcbdca856c9a/internal/k8s/service/resolver.go#L382-L384
If using Consul Enterprise, the Consul namespace will be inferred from the connectInject.consulNamespaces
configuration, for Consul OSS deployments it will be the default namespace.
I'm not quite sure what would be causing the TLS error when attempting to deploy an API Gateway in a secondary datacenter, but I believe that functionality is likewise not yet supported.
Thanks for getting back to me about this, it definitely helps explain what's going on. I did try MeshService, but it complained about the type (will check the error message, but I suspect I need to apply the following: https://github.com/hashicorp/consul-api-gateway/blob/main/config/crd/bases/api-gateway.consul.hashicorp.com_meshservices.yaml)
I will investigate this in more detail tomorrow and let you know how I get on. I have two options one is the Single Consul Datacenter in Multiple Kubernetes Clusters (https://www.consul.io/docs/k8s/installation/deployment-configurations/single-dc-multi-k8s) and the other Federation Between Kubernetes Clusters (https://www.consul.io/docs/k8s/installation/multi-cluster/kubernetes). I have managed to get either option working with varying degrees of success for cross cluster and service mesh communication.
Anyway, I will do more testing and update the thread tomorrow.
Missing CRD would definitely explain not being able to use MeshService
, make sure you're installing the CRDs as described at https://www.consul.io/docs/api-gateway/consul-api-gateway-install#installation to get Consul API Gateway's custom CRDs (such as MeshService) in addition to the upstream Gateway API CRDs.
Definitely let us know how anything you manage to get working, and we'll consider proper support for federated services as a feature for our roadmap.
@mikemorris , I was hoping to have a look at this, but realised that whatever configuration changes I have made, the cross cluster service mesh connection through the mesh gateway is now broken for Kafka. I was running kafka inside the service mesh and it was working. I've tried to roll back my changes but can't get it working again. It seems difficult for me to debug the issue. Is it work mentioning it here, open another ticket, or is there a better place to seek support for the mesh gateway?
By the way, I checked the CRDs, I had installed, but for a previous version, perhaps that will fix some of the issues: As for the kafka problem, I've opened a separate issue as it's something very different: https://github.com/hashicorp/consul/issues/14125 I will get back to you about this as soon as the kafka issue is fixed.
Looks like https://github.com/hashicorp/consul-k8s/issues/1344 is tracking the issue currently preventing creation of a Gateway in secondary datacenters in a WAN-federated Consul deployment.
Thanks @mikemorris, as you can see I've added my comment there as well. I've also fixed the issue I had with implementing kafka which now frees me up to do some more testing on the API gateway
@mikemorris I've now been able to do some more testing, if I add in kind: MeshService
I get the following error when looking at the route's status:
"parents": [
{
"conditions": [
{
"lastTransitionTime": "2022-08-17T10:33:01Z",
"message": "1 error occurred:\n\t* route is in an invalid state and cannot bind\n\n",
"observedGeneration": 2,
"reason": "BindError",
"status": "False",
"type": "Accepted"
},
{
"lastTransitionTime": "2022-08-17T10:33:01Z",
"message": "unsupported reference type",
"observedGeneration": 2,
"reason": "Errors",
"status": "False",
"type": "ResolvedRefs"
}
],
"controllerName": "hashicorp.com/consul-api-gateway-controller",
"parentRef": {
"group": "gateway.networking.k8s.io",
"kind": "Gateway",
"name": "api-gateway",
"namespace": "consul"
}
}
More importantly though, is there a way of debugging an HttpRoute? I've currently only got one route that's working, the second route looks like everything is correct, but when I try to curl the endpoint, it returns a 404 error. I can't see anything in any of the logs to tell me where the error is.
More importantly though, is there a way of debugging an HttpRoute?
How you've been doing it so far is correct - first checking the route status field, then controller logs - if something isn't implemented correctly it may be helpful to dump the actual applied Envoy config, but this should be enough to debug most cases (and when it's not, we could likely benefit from contributions improving status messages, logs, or docs).
A route is only "applied/in effect" when its type: Accepted
condition has status: True
(hence the 404 for no match), and would only successfully route to a backend when type: ResolvedRefs
also has status: True
.
if I add in kind: MeshService I get the following error when looking at the route's status:
"message": "unsupported reference type", "status": "False", "type": "ResolvedRefs"
In addition to specifying kind: MeshService
, it would also be necessary to set group: api-gateway.consul.hashicorp.com
in that BackendRef, as Group will default to the core API group of kind: Service
if unspecified (the mismatch is causing the unsupported reference type
error message - it's looking for a MeshService kind in the core API group, where it doesn't exist - if the CRD was installed, it should exist in our implementation-specific group).
This is documented in the Routes configuration docs, but should probably be mentioned in MeshService too.
@codex70 @manobi I recorded a demo yesterday pulling together the 3 related PRs that will be included across the upcoming consul-k8s v0.49.0
and consul-api-gateway v0.5.0
releases to support Gateway per cluster in a federated setup:
Note This adds support for a Gateway in the secondary datacenter routing to services within the same datacenter. This does not add support for routing from a Gateway in one datacenter to services in another datacenter. This is now reflected in our docs which will be updated again when the releases referenced above are completed.
https://user-images.githubusercontent.com/3476400/193070791-541d526e-2606-4560-84a4-1136f12c56f4.mp4
@nathancoleman I'll try this soon, thank you for sharing.
@nathancoleman I've tried with consul-k8s (0.49.0) and hashicorppreview/consul-api-gateway:0.5-dev
but still:
2022-10-02T00:09:03.658Z [ERROR] consul/certmanager.go:257: consul-api-gateway-server.cert-manager: error grabbing leaf certificate: error="Unexpected response code: 403 (rpc error making call: rpc error making call: Permission denied: token with AccessorID 'REDACTED' lacks permission 'service:write' on \"consul-api-gateway-controller\")"
This is what it looks like in consul ui on "DC2" (AcessorIDs and datacenter name have being redacted):
PS: my DC1 is still running consul-k8s v0.48.0 and many federated datacenters connected (31) each in a different version.
Hi @manobi :wave:
I was able to get everything working w/ fresh clusters/datacenters using 0.48.0
for the primary dc and 0.49.0
for the secondary dc. I do notice though that the role for the controller in my case has a policy attached where yours does not. I'm looking into how this could have come to be in your case. Does an analogous policy (api-gateway-controller-policy-<dc_name>
) exist in your UI and just isn't attached to the role, or does the policy not exist at all?
PS: any chance you could share your values.yaml
files? Also curious if you did an upgrade with the Gateway
already existing in your K8s cluster from when you had consul-k8s 0.48.0
installed, or did you recreate it after installing 0.49.0
?
Hi @nathancoleman
The policy does exists and when the secondary datacenter was created there was already a registered Gateway
in primary dc (v0.48.0).
apiGateway:
enabled: true
image: hashicorppreview/consul-api-gateway:0.5-dev
managedGatewayClass:
copyAnnotations:
service:
annotations: |
- service.beta.kubernetes.io/aws-load-balancer-backend-protocol
- service.beta.kubernetes.io/aws-load-balancer-name
- service.beta.kubernetes.io/aws-load-balancer-nlb-target-type
- service.beta.kubernetes.io/aws-load-balancer-scheme
- service.beta.kubernetes.io/aws-load-balancer-type
- service.beta.kubernetes.io/aws-load-balancer-ssl-cert
client:
extraConfig: |
{
"leave_on_terminate": true,
"advertise_reconnect_timeout": "60s",
"limits": {
"http_max_conns_per_client": 65535
}
}
priorityClassName: heaviest
resources:
limits:
cpu: 100m
memory: 350Mi
requests:
cpu: 20m
memory: 200Mi
connectInject:
default: false
enabled: true
metrics:
defaultEnableMerging: false
defaultEnabled: false
resources:
limits:
cpu: 50m
memory: 180Mi
requests:
cpu: 50m
memory: 180Mi
sidecarProxy:
resources:
limits:
cpu: 100m
memory: 100Mi
requests:
cpu: 13m
memory: 81Mi
controller:
enabled: true
resources:
limits:
cpu: 100m
memory: 50Mi
requests:
cpu: 100m
memory: 50Mi
global:
acls:
createReplicationToken: false
manageSystemACLs: true
replicationToken:
secretKey: replicationToken
secretName: consul-consul-federation
consulAPITimeout: 5m
datacenter: qa-ecommerce
enableGatewayMetrics: true
federation:
enabled: true
k8sAuthMethodHost: <REDACTED>
primaryDatacenter: dc1
metrics:
agentMetricsRetentionTime: 1m
baseURL: http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090
enableGatewayMetrics: true
enabled: true
tls:
caCert:
secretKey: caCert
secretName: consul-consul-federation
caKey:
secretKey: caKey
secretName: consul-consul-federation
enabled: true
ingressGateways:
defaults:
service:
annotations: |
"service.beta.kubernetes.io/aws-load-balancer-name": "qa-ecommerce-consul-ingress-gate"
"service.beta.kubernetes.io/aws-load-balancer-nlb-target-type": "ip"
"service.beta.kubernetes.io/aws-load-balancer-scheme": "internal"
"service.beta.kubernetes.io/aws-load-balancer-ssl-cert": ""
"service.beta.kubernetes.io/aws-load-balancer-type": "nlb-ip"
ports:
- nodePort: null
port: 443
type: LoadBalancer
enabled: false
gateways:
- name: ingress-gateway
resources:
limits:
cpu: 400m
memory: 150Mi
requests:
cpu: 160m
memory: 100Mi
meshGateway:
enabled: true
replicas: 1
resources:
limits:
cpu: 300m
memory: 100Mi
requests:
cpu: 100m
memory: 100Mi
service:
annotations: |
"service.beta.kubernetes.io/aws-load-balancer-backend-protocol": "ssl"
"service.beta.kubernetes.io/aws-load-balancer-internal": "true"
"service.beta.kubernetes.io/aws-load-balancer-name": "qa-ecommerce-consul-mesh-gateway"
"service.beta.kubernetes.io/aws-load-balancer-nlb-target-type": "ip"
"service.beta.kubernetes.io/aws-load-balancer-scheme": "internal"
"service.beta.kubernetes.io/aws-load-balancer-type": "nlb-ip"
server:
extraConfig: |
{
"ui_config": {
"enabled": true,
"metrics_provider": "prometheus",
"metrics_proxy": {
"base_url": "http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090"
},
"dashboard_url_templates": {
"service": "<redacted>"
}
}
}
extraVolumes:
- items:
- key: serverConfigJSON
path: config.json
load: true
name: consul-consul-federation
type: secret
nodeSelector: ""
priorityClassName: heavy
resources:
limits:
cpu: 500m
memory: 700Mi
requests:
cpu: 250m
memory: 400Mi
ui:
metrics:
baseURL: http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090
enabled: true
provider: prometheus
@manobi if you apply that policy to the role analogous to the one I screenshotted, does everything work for you standing up a Gateway
in the secondary dc?
@nathancoleman From the UI it's not working, the browser crashes while loading the policy options. Maybe there is too much roles/policies and the same error happens during tokens bootstrap?
consul acl policy list -token=<redacted> | grep ID | wc -l
252
consul acl role update -id=16382188-2b3f-a628-a434-af342bf2f97e -policy-id=d1acd2a4-bffc-7ddf-63b5-14af3f338417 -token=<redacted>
After that the consul-api-gateway-controller
seems to be running, but how I can make sure it will work the next time I upgrade?
@manobi I'm hoping to understand why it failed in this case. Any chance you have the logs from the consul-api-gateway-controller pod's api-gateway-controller-acl-init
container when this failed? It seems like the logic to bind the policy to the role here failed
Even after the manual attachment the api-gateway-controller-acl-init
failed twice, before started running with the following logs:
2022-10-03T20:14:33.393Z [INFO] Consul login complete
2022-10-03T20:14:33.393Z [INFO] Checking that the ACL token exists when reading it in the stale consistency mode
2022-10-03T20:14:33.394Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"
2022-10-03T20:14:33.497Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"
2022-10-03T20:14:33.598Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"
2022-10-03T20:14:33.701Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"
2022-10-03T20:14:33.803Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"
2022-10-03T20:14:33.905Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"
2022-10-03T20:14:34.008Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"
2022-10-03T20:14:34.110Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"
2022-10-03T20:14:34.214Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"
2022-10-03T20:14:34.316Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"
2022-10-03T20:14:34.418Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"
2022-10-03T20:14:34.520Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"
2022-10-03T20:14:34.623Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"
2022-10-03T20:14:34.725Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"
2022-10-03T20:14:34.827Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"
I've noticed a similar behaviour with mesh-gateway
and controller
components as well.
After your direction and the UI crashing I'm starting to believe it is skipping the binding rules list somehow, when there are many items to process.
Might be not related to api-gateway but some consul-k8s bug.
@manobi that would make sense as the possible cause. That scale is the main difference between my temporary setups and your own. I'll be traveling most of this week but will see if I can find out anything once I'm back.
The 403 (ACL not found)
errors look like they could be a manifestation of https://github.com/hashicorp/consul-k8s/pull/887
@nathancoleman could we maybe implement the same workaround as consul-ecs did in https://github.com/hashicorp/consul-ecs/pull/79 until Consul adds "read your writes" support for an improved consul login
UX (without the performance overhead of switching to consistent reads)?
@mikemorris
Given that my api-gateway-controller
is running and I have deployed the Gateway
resource, when I apply the ReferenceGrant
and HTTPRoute
in my secondary dc the routing does not seem to be working.
Is there a way to debug if the routing have actually being registered? Unlike Gateways
in primary dc consul ui does not show connections between gateway and target service.
With log-level=trace
enabled I saw the following status:
"conditions": [
| {
| "type": "Ready",
| "status": "True",
| "observedGeneration": 1,
| "lastTransitionTime": "2022-10-04T22:52:16Z",
| "reason": "Ready",
| "message": "Ready"
| },
| {
| "type": "Scheduled",
| "status": "True",
| "observedGeneration": 1,
| "lastTransitionTime": "2022-10-04T22:52:16Z",
| "reason": "Scheduled",
| "message": "Scheduled"
| },
| {
| "type": "InSync",
| "status": "False",
| "observedGeneration": 1,
| "lastTransitionTime": "2022-10-04T22:52:16Z",
| "reason": "SyncError",
| "message": "error adding ingress config entry: 1 error occurred:\n\t* Unexpected response code: 403 (rpc error making call: rpc error making call: Permission denied: token with AccessorID '0323cd06-e494-1d61-2cc9-3f8570954046' lacks permission 'mesh:write')\n\n"
| }
| ],
HTTPRoute resource status seems to be ok but it's working:
status:
parents:
- conditions:
- lastTransitionTime: '2022-10-04T23:04:20Z'
message: Route accepted.
observedGeneration: 1
reason: Accepted
status: 'True'
type: Accepted
- lastTransitionTime: '2022-10-04T23:04:20Z'
message: ResolvedRefs
observedGeneration: 1
reason: ResolvedRefs
status: 'True'
type: ResolvedRefs
Upstreams in secondary DC (0):
Upstreams in primary DC (1):
consul-k8s proxy read <gateway-pod-name> -context=dc2
:
==> Clusters (3)
==> Endpoints (3)
==> Listeners (1)
==> Routes (1)
==> Secrets (2)
consul-k8s proxy read <gateway-pod-name> -context=dc1
:
==> Clusters (6)
==> Endpoints (6)
==> Listeners (2)
==> Routes (1)
==> Secrets (2)
Hi @manobi , were you able to get this working? Just to clarify, your Gateway
, HTTPRoute
, ReferenceGrant
and backend Service
that the route is targeting are all in the secondary datacenter, correct?
Hi @manobi , were you able to get this working? Just to clarify, your
Gateway
,HTTPRoute
,ReferenceGrant
and backendService
that the route is targeting are all in the secondary datacenter, correct?
Yes they are all running in the secondary datacenter, but I have not being able to get this working. Still seeing the following in api-gateway-controller
:
error adding ingress config entry: 1 error occurred:\n\t* Unexpected response code: 403 (rpc error making call: rpc error making call: rpc error making call: Permission denied: token with AccessorID '0323cd06-e494-1d61-2cc9-3f8570954046' lacks permission 'mesh:write')\n\n
How can I force this "mesh:write" permission ?
The gateway deployment is running in secondary datacenter, but there is no service-default or ingress-gateway registered. What policy should api-gateway-controller use to able to register those configs?
@manobi I'd expect it to be using api-gateway-controller-policy-<datacenter>
which has the higher-level operator = "write"
permission. You can see what I'm expecting in the screenshot a ways up https://github.com/hashicorp/consul-api-gateway/issues/300#issuecomment-1265925913.
It makes sense that the config entries aren't registered because the controller isn't able to create them in your setup. I'm not yet sure why this is, and I haven't been able to reproduce it myself.
Just to be certain, to replicate your setup, I need consul-k8s v0.48.0 in my primary datacenter and consul-k8s v0.49.0 in my secondary datacenter. Is that accurate? Are you using consul-api-gateway v0.5-dev in both datacenters?
@nathancoleman The only way I've managed to make it work was by attaching thecontroller-policy
in api-gateway-controller
token.
My current setup is the following one:
Primary datacenter:
Secondary datacenter:
@manobi here's a writeup of the whole process I went through to replicate the issue, but I'm still seeing everything work. I figure at least this will show what the Kubernetes Deployment
and Consul roles+policies for the consul-api-gateway-controller should look like. Can you take a look and let me know if anything I'm doing doesn't match your setup or if you can identify the diff between my resulting config and yours? Feel free to comment right on the gist if you like.
https://gist.github.com/nathancoleman/076343780c3e0b4c03fb91f9d4f84616
@nathancoleman thank you, I'll try to reproduce your steps. The manual changes I have done, allowed me to test other things. Do you think something changed in 0.5 that would break URLrewrite?
The service router is not reading the filters with URLRewrite:
apiVersion: gateway.networking.k8s.io/v1beta1
kind: HTTPRoute
metadata:
name: my-service
namespace: consul
spec:
parentRefs:
- name: digital-api-qa
rules:
- matches:
- path:
type: PathPrefix
value: "/my-service/v1"
backendRefs:
- kind: Service
name: my-service
namespace: my-service
port: 80
weight: 100
filters:
- type: URLRewrite
urlRewrite:
path:
type: ReplacePrefixMatch
replacePrefixMatch: "/api/v1"
Becomes:
{
"Kind": "service-router",
"Name": "digital-api-qa-735653bb",
"Routes": [
{
"Match": {
"HTTP": {
"PathPrefix": "/my-service/v1"
}
},
"Destination": {
"Service": "my-service",
"RequestHeaders": {}
}
}
],
"Meta": {
"consul-api-gateway/k8s/Gateway.Name": "digital-api-qa",
"consul-api-gateway/k8s/Gateway.Namespace": "consul",
"external-source": "consul-api-gateway"
},
"CreateIndex": 242705,
"ModifyIndex": 242705
}
@manobi thanks for calling that out. Fixed in https://github.com/hashicorp/consul-api-gateway/pull/414
@manobi I'm asking around to see if anyone has encountered issues like the role bindings failing to apply at a scale of hundreds of roles/policies.
My understanding is that the missing role bindings are the only issue you're seeing at this point (given the fix in #414) and that everything works as expected when you manually apply those bindings. Is that accurate?
@nathancoleman Accurate. The ACL not found error is not restricted to API gateway, I can see it in other components that eventually reconcile.
It might be the problem mentioned by @mikemorris, if I have to rolebind manually it's not a huge problem.
I was more worried while I have no ideas what was going on. Thank you.
@nathancoleman Will the https://github.com/hashicorp/consul-api-gateway/pull/414 fix be automatically published to Docker Hub or is it a manual action? I'm looking forward to put my hands on it and maybe create another issue in consul-k8s to investigate the race condition in ACL, as it looks like there is no problems consul-api-gateway itself.
Seems unfair to hold the v0.5 release if there are no other issues.
@manobi you'll see it published to Docker Hub in a few minutes after I merge https://github.com/hashicorp/consul-api-gateway/pull/416. The merge of #414 itself didn't publish because our tooling identified the CVE referenced in #416.
Edit: You can now see an updated set of tags out on https://hub.docker.com/r/hashicorppreview/consul-api-gateway/tags
Just to confirm that I've got the URLrewrite back to work with: hashicorppreview/consul-api-gateway:0.5-dev-55da4a56cda79d0e97a7f2d40f503923ff57ba62
Thank you @nathancoleman
@codex70 @manobi I believe this particular issue can be closed now but wanted to run it by you first. Thoughts?
The upcoming v0.5.0 release of Consul API Gateway will allow you to run the API gateway controller and create Gateways that route to services within the same datacenter whether that datacenter is a primary or secondary datacenter.
We should close it. Thanks
Just to confirm I have been able to test this and it is now working following on from the fix for: https://github.com/hashicorp/consul-k8s/issues/1344
Overview of the Issue
I don't seem to be able to set up API gateway in such a way that I can either have access to all mesh services from a single API Gateway, or using and API Gateway per cluster.
Reproduction Steps
Logs
Error when trying to add mesh service from second cluster to API Gateway in first cluster
Error when trying to connect to a second API Gateway in the second datacenter cluster.
Expected behavior
There is a documented solution for setting up API Gateways across federated clusters.
Environment details
Additional Context
I suspect this is a simple case of me not seeing the specific documentation required to set this up correctly, but I'm having a lot of problems getting the API Gateway up and running across multiple clusters.