hashicorp / consul-api-gateway

The Consul API Gateway is a dedicated ingress solution for intelligently routing traffic to applications running on a Consul Service Mesh.
Mozilla Public License 2.0
99 stars 16 forks source link

There doesn't appear to be a way to create an API Gateway, or Gateway per cluster in a federated WAN #300

Closed codex70 closed 1 year ago

codex70 commented 2 years ago

Overview of the Issue

I don't seem to be able to set up API gateway in such a way that I can either have access to all mesh services from a single API Gateway, or using and API Gateway per cluster.

Reproduction Steps

  1. Set up an initial cluster using HELM charts and creating an API Gateway (this all works as expected)
  2. Set up a second federated cluster following the instructions here: https://www.consul.io/docs/k8s/installation/multi-cluster/kubernetes
  3. Services in the second datacenter are not accessible to the API Gateway created in the first datacenter cluster.
  4. Using the federated setup, creating a new API Gateway to access services in the second datacenter fail with SSL connection issues.

Logs

Error when trying to add mesh service from second cluster to API Gateway in first cluster

k get httproute/test-service-route -n test -o jsonpath='{.status}' | jq
{
  "parents": [
    {
      "conditions": [
        {
          "lastTransitionTime": "2022-08-08T07:38:16Z",
          "message": "1 error occurred:\n\t* route is in an invalid state and cannot bind\n\n",
          "observedGeneration": 2,
          "reason": "BindError",
          "status": "False",
          "type": "Accepted"
        },
        {
          "lastTransitionTime": "2022-08-08T07:38:16Z",
          "message": "k8s: service test/test-service not found",
          "observedGeneration": 2,
          "reason": "ServiceNotFound",
          "status": "False",
          "type": "ResolvedRefs"
        }
      ],
      "controllerName": "hashicorp.com/consul-api-gateway-controller",
      "parentRef": {
        "group": "gateway.networking.k8s.io",
        "kind": "Gateway",
        "name": "api-gateway",
        "namespace": "consul"
      }
    }
  ]
}

Error when trying to connect to a second API Gateway in the second datacenter cluster.

curl -vvi -k --header "Host: test-service.api.gateway" "https://${API}:8443/
* TCP_NODELAY set
* Connected to X.X.X.X (X.X.X.X) port 8443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/pki/tls/certs/ca-bundle.crt
  CApath: none
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to X.X.X.X:8445
* Closing connection 0
curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to X.X.X.X:8445

Expected behavior

There is a documented solution for setting up API Gateways across federated clusters.

Environment details

Additional Context

I suspect this is a simple case of me not seeing the specific documentation required to set this up correctly, but I'm having a lot of problems getting the API Gateway up and running across multiple clusters.

mikemorris commented 2 years ago
  1. First, are you using the kind: MeshService backend? The ResolvedRefs status condition you're seeing seems to indicate a failure to resolve a Kubernetes Service named test-service in the test namespace, rather than a Consul service - the standard kind: Service Route backend will only find Kuberentes Services in the same Kubernetes cluster, not Consul services outside the Kubernetes cluster to which Consul API Gateway is deployed.

  2. While this doesn't seem to be documented, I believe the functionality of forwarding traffic to Consul services in other datacenters is not yet supported. Consul service resolution from MeshService uses findCatalogService and doesn't specify a Datacenter parameter for api.QueryOptions, which I believe would limit results to Consul services registered in the same datacenter as the Consul agent serving the API request. If you're trying to reach a service from a different Kubernetes cluster registered in the same Consul datacenter though, this may work, but I haven't tested to confirm. https://github.com/hashicorp/consul-api-gateway/blob/145bcc9bf009a21b2170f7c27928bcbdca856c9a/internal/k8s/service/resolver.go#L382-L384

    If using Consul Enterprise, the Consul namespace will be inferred from the connectInject.consulNamespaces configuration, for Consul OSS deployments it will be the default namespace.

  3. I'm not quite sure what would be causing the TLS error when attempting to deploy an API Gateway in a secondary datacenter, but I believe that functionality is likewise not yet supported.

codex70 commented 2 years ago

Thanks for getting back to me about this, it definitely helps explain what's going on. I did try MeshService, but it complained about the type (will check the error message, but I suspect I need to apply the following: https://github.com/hashicorp/consul-api-gateway/blob/main/config/crd/bases/api-gateway.consul.hashicorp.com_meshservices.yaml)

I will investigate this in more detail tomorrow and let you know how I get on. I have two options one is the Single Consul Datacenter in Multiple Kubernetes Clusters (https://www.consul.io/docs/k8s/installation/deployment-configurations/single-dc-multi-k8s) and the other Federation Between Kubernetes Clusters (https://www.consul.io/docs/k8s/installation/multi-cluster/kubernetes). I have managed to get either option working with varying degrees of success for cross cluster and service mesh communication.

Anyway, I will do more testing and update the thread tomorrow.

mikemorris commented 2 years ago

Missing CRD would definitely explain not being able to use MeshService, make sure you're installing the CRDs as described at https://www.consul.io/docs/api-gateway/consul-api-gateway-install#installation to get Consul API Gateway's custom CRDs (such as MeshService) in addition to the upstream Gateway API CRDs.

Definitely let us know how anything you manage to get working, and we'll consider proper support for federated services as a feature for our roadmap.

codex70 commented 2 years ago

@mikemorris , I was hoping to have a look at this, but realised that whatever configuration changes I have made, the cross cluster service mesh connection through the mesh gateway is now broken for Kafka. I was running kafka inside the service mesh and it was working. I've tried to roll back my changes but can't get it working again. It seems difficult for me to debug the issue. Is it work mentioning it here, open another ticket, or is there a better place to seek support for the mesh gateway?

codex70 commented 2 years ago

By the way, I checked the CRDs, I had installed, but for a previous version, perhaps that will fix some of the issues: As for the kafka problem, I've opened a separate issue as it's something very different: https://github.com/hashicorp/consul/issues/14125 I will get back to you about this as soon as the kafka issue is fixed.

mikemorris commented 2 years ago

Looks like https://github.com/hashicorp/consul-k8s/issues/1344 is tracking the issue currently preventing creation of a Gateway in secondary datacenters in a WAN-federated Consul deployment.

codex70 commented 2 years ago

Thanks @mikemorris, as you can see I've added my comment there as well. I've also fixed the issue I had with implementing kafka which now frees me up to do some more testing on the API gateway

codex70 commented 2 years ago

@mikemorris I've now been able to do some more testing, if I add in kind: MeshService I get the following error when looking at the route's status:

  "parents": [
    {
      "conditions": [
        {
          "lastTransitionTime": "2022-08-17T10:33:01Z",
          "message": "1 error occurred:\n\t* route is in an invalid state and cannot bind\n\n",
          "observedGeneration": 2,
          "reason": "BindError",
          "status": "False",
          "type": "Accepted"
        },
        {
          "lastTransitionTime": "2022-08-17T10:33:01Z",
          "message": "unsupported reference type",
          "observedGeneration": 2,
          "reason": "Errors",
          "status": "False",
          "type": "ResolvedRefs"
        }
      ],
      "controllerName": "hashicorp.com/consul-api-gateway-controller",
      "parentRef": {
        "group": "gateway.networking.k8s.io",
        "kind": "Gateway",
        "name": "api-gateway",
        "namespace": "consul"
      }
    }
codex70 commented 2 years ago

More importantly though, is there a way of debugging an HttpRoute? I've currently only got one route that's working, the second route looks like everything is correct, but when I try to curl the endpoint, it returns a 404 error. I can't see anything in any of the logs to tell me where the error is.

mikemorris commented 2 years ago

More importantly though, is there a way of debugging an HttpRoute?

How you've been doing it so far is correct - first checking the route status field, then controller logs - if something isn't implemented correctly it may be helpful to dump the actual applied Envoy config, but this should be enough to debug most cases (and when it's not, we could likely benefit from contributions improving status messages, logs, or docs).

A route is only "applied/in effect" when its type: Accepted condition has status: True (hence the 404 for no match), and would only successfully route to a backend when type: ResolvedRefs also has status: True.

if I add in kind: MeshService I get the following error when looking at the route's status:

"message": "unsupported reference type",
"status": "False",
"type": "ResolvedRefs"

In addition to specifying kind: MeshService, it would also be necessary to set group: api-gateway.consul.hashicorp.com in that BackendRef, as Group will default to the core API group of kind: Service if unspecified (the mismatch is causing the unsupported reference type error message - it's looking for a MeshService kind in the core API group, where it doesn't exist - if the CRD was installed, it should exist in our implementation-specific group).

This is documented in the Routes configuration docs, but should probably be mentioned in MeshService too.

nathancoleman commented 1 year ago

@codex70 @manobi I recorded a demo yesterday pulling together the 3 related PRs that will be included across the upcoming consul-k8s v0.49.0 and consul-api-gateway v0.5.0 releases to support Gateway per cluster in a federated setup:

Note This adds support for a Gateway in the secondary datacenter routing to services within the same datacenter. This does not add support for routing from a Gateway in one datacenter to services in another datacenter. This is now reflected in our docs which will be updated again when the releases referenced above are completed.

https://user-images.githubusercontent.com/3476400/193070791-541d526e-2606-4560-84a4-1136f12c56f4.mp4

manobi commented 1 year ago

@nathancoleman I'll try this soon, thank you for sharing.

manobi commented 1 year ago

@nathancoleman I've tried with consul-k8s (0.49.0) and hashicorppreview/consul-api-gateway:0.5-dev but still:

2022-10-02T00:09:03.658Z [ERROR] consul/certmanager.go:257: consul-api-gateway-server.cert-manager: error grabbing leaf certificate: error="Unexpected response code: 403 (rpc error making call: rpc error making call: Permission denied: token with AccessorID 'REDACTED' lacks permission 'service:write' on \"consul-api-gateway-controller\")"

This is what it looks like in consul ui on "DC2" (AcessorIDs and datacenter name have being redacted):

Screen Shot 2022-10-01 at 22 43 44 Screen Shot 2022-10-01 at 22 40 16 Screen Shot 2022-10-01 at 22 40 24

PS: my DC1 is still running consul-k8s v0.48.0 and many federated datacenters connected (31) each in a different version.

nathancoleman commented 1 year ago

Hi @manobi :wave: I was able to get everything working w/ fresh clusters/datacenters using 0.48.0 for the primary dc and 0.49.0 for the secondary dc. I do notice though that the role for the controller in my case has a policy attached where yours does not. I'm looking into how this could have come to be in your case. Does an analogous policy (api-gateway-controller-policy-<dc_name>) exist in your UI and just isn't attached to the role, or does the policy not exist at all?

PS: any chance you could share your values.yaml files? Also curious if you did an upgrade with the Gateway already existing in your K8s cluster from when you had consul-k8s 0.48.0 installed, or did you recreate it after installing 0.49.0?

image
manobi commented 1 year ago

Hi @nathancoleman The policy does exists and when the secondary datacenter was created there was already a registered Gateway in primary dc (v0.48.0).

Screen Shot 2022-10-03 at 16 34 57
apiGateway:
  enabled: true
  image: hashicorppreview/consul-api-gateway:0.5-dev
  managedGatewayClass:
    copyAnnotations:
      service:
        annotations: |
          - service.beta.kubernetes.io/aws-load-balancer-backend-protocol
          - service.beta.kubernetes.io/aws-load-balancer-name
          - service.beta.kubernetes.io/aws-load-balancer-nlb-target-type
          - service.beta.kubernetes.io/aws-load-balancer-scheme
          - service.beta.kubernetes.io/aws-load-balancer-type
          - service.beta.kubernetes.io/aws-load-balancer-ssl-cert
client:
  extraConfig: |
    {
      "leave_on_terminate": true,
      "advertise_reconnect_timeout": "60s",
      "limits": {
        "http_max_conns_per_client": 65535
      }
    }
  priorityClassName: heaviest
  resources:
    limits:
      cpu: 100m
      memory: 350Mi
    requests:
      cpu: 20m
      memory: 200Mi
connectInject:
  default: false
  enabled: true
  metrics:
    defaultEnableMerging: false
    defaultEnabled: false
  resources:
    limits:
      cpu: 50m
      memory: 180Mi
    requests:
      cpu: 50m
      memory: 180Mi
  sidecarProxy:
    resources:
      limits:
        cpu: 100m
        memory: 100Mi
      requests:
        cpu: 13m
        memory: 81Mi
controller:
  enabled: true
  resources:
    limits:
      cpu: 100m
      memory: 50Mi
    requests:
      cpu: 100m
      memory: 50Mi
global:
  acls:
    createReplicationToken: false
    manageSystemACLs: true
    replicationToken:
      secretKey: replicationToken
      secretName: consul-consul-federation
  consulAPITimeout: 5m
  datacenter: qa-ecommerce
  enableGatewayMetrics: true
  federation:
    enabled: true
    k8sAuthMethodHost: <REDACTED>
    primaryDatacenter: dc1
  metrics:
    agentMetricsRetentionTime: 1m
    baseURL: http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090
    enableGatewayMetrics: true
    enabled: true
  tls:
    caCert:
      secretKey: caCert
      secretName: consul-consul-federation
    caKey:
      secretKey: caKey
      secretName: consul-consul-federation
    enabled: true
ingressGateways:
  defaults:
    service:
      annotations: |
        "service.beta.kubernetes.io/aws-load-balancer-name": "qa-ecommerce-consul-ingress-gate"
        "service.beta.kubernetes.io/aws-load-balancer-nlb-target-type": "ip"
        "service.beta.kubernetes.io/aws-load-balancer-scheme": "internal"
        "service.beta.kubernetes.io/aws-load-balancer-ssl-cert": ""
        "service.beta.kubernetes.io/aws-load-balancer-type": "nlb-ip"
      ports:
      - nodePort: null
        port: 443
      type: LoadBalancer
  enabled: false
  gateways:
  - name: ingress-gateway
  resources:
    limits:
      cpu: 400m
      memory: 150Mi
    requests:
      cpu: 160m
      memory: 100Mi
meshGateway:
  enabled: true
  replicas: 1
  resources:
    limits:
      cpu: 300m
      memory: 100Mi
    requests:
      cpu: 100m
      memory: 100Mi
  service:
    annotations: |
      "service.beta.kubernetes.io/aws-load-balancer-backend-protocol": "ssl"
      "service.beta.kubernetes.io/aws-load-balancer-internal": "true"
      "service.beta.kubernetes.io/aws-load-balancer-name": "qa-ecommerce-consul-mesh-gateway"
      "service.beta.kubernetes.io/aws-load-balancer-nlb-target-type": "ip"
      "service.beta.kubernetes.io/aws-load-balancer-scheme": "internal"
      "service.beta.kubernetes.io/aws-load-balancer-type": "nlb-ip"
server:
  extraConfig: |
    {
      "ui_config": {
        "enabled": true,
        "metrics_provider": "prometheus",
        "metrics_proxy": {
          "base_url": "http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090"
        },
        "dashboard_url_templates": {
          "service": "<redacted>"
        }
      }
    }
  extraVolumes:
  - items:
    - key: serverConfigJSON
      path: config.json
    load: true
    name: consul-consul-federation
    type: secret
  nodeSelector: ""
  priorityClassName: heavy
  resources:
    limits:
      cpu: 500m
      memory: 700Mi
    requests:
      cpu: 250m
      memory: 400Mi
ui:
  metrics:
    baseURL: http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090
    enabled: true
    provider: prometheus
nathancoleman commented 1 year ago

@manobi if you apply that policy to the role analogous to the one I screenshotted, does everything work for you standing up a Gateway in the secondary dc?

manobi commented 1 year ago

@nathancoleman From the UI it's not working, the browser crashes while loading the policy options. Maybe there is too much roles/policies and the same error happens during tokens bootstrap?

consul acl policy list -token=<redacted> | grep ID | wc -l
252
consul acl role update -id=16382188-2b3f-a628-a434-af342bf2f97e -policy-id=d1acd2a4-bffc-7ddf-63b5-14af3f338417 -token=<redacted>

After that the consul-api-gateway-controller seems to be running, but how I can make sure it will work the next time I upgrade?

nathancoleman commented 1 year ago

@manobi I'm hoping to understand why it failed in this case. Any chance you have the logs from the consul-api-gateway-controller pod's api-gateway-controller-acl-init container when this failed? It seems like the logic to bind the policy to the role here failed

manobi commented 1 year ago

Even after the manual attachment the api-gateway-controller-acl-init failed twice, before started running with the following logs:

2022-10-03T20:14:33.393Z [INFO] Consul login complete
2022-10-03T20:14:33.393Z [INFO] Checking that the ACL token exists when reading it in the stale consistency mode
2022-10-03T20:14:33.394Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"
2022-10-03T20:14:33.497Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"
2022-10-03T20:14:33.598Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"
2022-10-03T20:14:33.701Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"
2022-10-03T20:14:33.803Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"
2022-10-03T20:14:33.905Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"
2022-10-03T20:14:34.008Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"
2022-10-03T20:14:34.110Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"
2022-10-03T20:14:34.214Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"
2022-10-03T20:14:34.316Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"
2022-10-03T20:14:34.418Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"
2022-10-03T20:14:34.520Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"
2022-10-03T20:14:34.623Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"
2022-10-03T20:14:34.725Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"
2022-10-03T20:14:34.827Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"

I've noticed a similar behaviour with mesh-gateway and controller components as well. After your direction and the UI crashing I'm starting to believe it is skipping the binding rules list somehow, when there are many items to process.

Might be not related to api-gateway but some consul-k8s bug.

nathancoleman commented 1 year ago

@manobi that would make sense as the possible cause. That scale is the main difference between my temporary setups and your own. I'll be traveling most of this week but will see if I can find out anything once I'm back.

mikemorris commented 1 year ago

The 403 (ACL not found) errors look like they could be a manifestation of https://github.com/hashicorp/consul-k8s/pull/887

@nathancoleman could we maybe implement the same workaround as consul-ecs did in https://github.com/hashicorp/consul-ecs/pull/79 until Consul adds "read your writes" support for an improved consul login UX (without the performance overhead of switching to consistent reads)?

manobi commented 1 year ago

@mikemorris Given that my api-gateway-controller is running and I have deployed the Gateway resource, when I apply the ReferenceGrant and HTTPRoute in my secondary dc the routing does not seem to be working.

Is there a way to debug if the routing have actually being registered? Unlike Gateways in primary dc consul ui does not show connections between gateway and target service.

With log-level=trace enabled I saw the following status:

"conditions": [
  |     {
  |       "type": "Ready",
  |       "status": "True",
  |       "observedGeneration": 1,
  |       "lastTransitionTime": "2022-10-04T22:52:16Z",
  |       "reason": "Ready",
  |       "message": "Ready"
  |     },
  |     {
  |       "type": "Scheduled",
  |       "status": "True",
  |       "observedGeneration": 1,
  |       "lastTransitionTime": "2022-10-04T22:52:16Z",
  |       "reason": "Scheduled",
  |       "message": "Scheduled"
  |     },
  |     {
  |       "type": "InSync",
  |       "status": "False",
  |       "observedGeneration": 1,
  |       "lastTransitionTime": "2022-10-04T22:52:16Z",
  |       "reason": "SyncError",
  |       "message": "error adding ingress config entry: 1 error occurred:\n\t* Unexpected response code: 403 (rpc error making call: rpc error making call: Permission denied: token with AccessorID '0323cd06-e494-1d61-2cc9-3f8570954046' lacks permission 'mesh:write')\n\n"
  |     }
  |   ],

HTTPRoute resource status seems to be ok but it's working:

status:
  parents:
    - conditions:
        - lastTransitionTime: '2022-10-04T23:04:20Z'
          message: Route accepted.
          observedGeneration: 1
          reason: Accepted
          status: 'True'
          type: Accepted
        - lastTransitionTime: '2022-10-04T23:04:20Z'
          message: ResolvedRefs
          observedGeneration: 1
          reason: ResolvedRefs
          status: 'True'
          type: ResolvedRefs

Upstreams in secondary DC (0):

Screen Shot 2022-10-04 at 20 15 12

Upstreams in primary DC (1):

Screen Shot 2022-10-04 at 20 15 42

consul-k8s proxy read <gateway-pod-name> -context=dc2:

==> Clusters (3)
==> Endpoints (3)   
==> Listeners (1)
==> Routes (1)
==> Secrets (2)

consul-k8s proxy read <gateway-pod-name> -context=dc1:

==> Clusters (6)
==> Endpoints (6)
==> Listeners (2)
==> Routes (1)
==> Secrets (2)
nathancoleman commented 1 year ago

Hi @manobi , were you able to get this working? Just to clarify, your Gateway, HTTPRoute, ReferenceGrant and backend Service that the route is targeting are all in the secondary datacenter, correct?

manobi commented 1 year ago

Hi @manobi , were you able to get this working? Just to clarify, your Gateway, HTTPRoute, ReferenceGrant and backend Service that the route is targeting are all in the secondary datacenter, correct?

Yes they are all running in the secondary datacenter, but I have not being able to get this working. Still seeing the following in api-gateway-controller:

error adding ingress config entry: 1 error occurred:\n\t* Unexpected response code: 403 (rpc error making call: rpc error making call: rpc error making call: Permission denied: token with AccessorID '0323cd06-e494-1d61-2cc9-3f8570954046' lacks permission 'mesh:write')\n\n

How can I force this "mesh:write" permission ?

manobi commented 1 year ago

https://github.com/hashicorp/consul-api-gateway/blob/8f9040100434a648713a55f30950c182e29f5c22/internal/adapters/consul/sync.go#L354

The gateway deployment is running in secondary datacenter, but there is no service-default or ingress-gateway registered. What policy should api-gateway-controller use to able to register those configs?

nathancoleman commented 1 year ago

@manobi I'd expect it to be using api-gateway-controller-policy-<datacenter> which has the higher-level operator = "write" permission. You can see what I'm expecting in the screenshot a ways up https://github.com/hashicorp/consul-api-gateway/issues/300#issuecomment-1265925913.

It makes sense that the config entries aren't registered because the controller isn't able to create them in your setup. I'm not yet sure why this is, and I haven't been able to reproduce it myself.

Just to be certain, to replicate your setup, I need consul-k8s v0.48.0 in my primary datacenter and consul-k8s v0.49.0 in my secondary datacenter. Is that accurate? Are you using consul-api-gateway v0.5-dev in both datacenters?

manobi commented 1 year ago

@nathancoleman The only way I've managed to make it work was by attaching thecontroller-policy in api-gateway-controller token.

My current setup is the following one:

Primary datacenter:

Secondary datacenter:

nathancoleman commented 1 year ago

@manobi here's a writeup of the whole process I went through to replicate the issue, but I'm still seeing everything work. I figure at least this will show what the Kubernetes Deployment and Consul roles+policies for the consul-api-gateway-controller should look like. Can you take a look and let me know if anything I'm doing doesn't match your setup or if you can identify the diff between my resulting config and yours? Feel free to comment right on the gist if you like.

https://gist.github.com/nathancoleman/076343780c3e0b4c03fb91f9d4f84616

manobi commented 1 year ago

@nathancoleman thank you, I'll try to reproduce your steps. The manual changes I have done, allowed me to test other things. Do you think something changed in 0.5 that would break URLrewrite?

The service router is not reading the filters with URLRewrite:

apiVersion: gateway.networking.k8s.io/v1beta1
kind: HTTPRoute
metadata:
  name: my-service
  namespace: consul
spec:
  parentRefs:
  - name: digital-api-qa
  rules:
    - matches:
      - path:
          type: PathPrefix
          value: "/my-service/v1"
      backendRefs:
        - kind: Service
          name: my-service
          namespace: my-service
          port: 80
          weight: 100
      filters:
      - type: URLRewrite
        urlRewrite:
          path:
            type: ReplacePrefixMatch
            replacePrefixMatch: "/api/v1"

Becomes:

{
    "Kind": "service-router",
    "Name": "digital-api-qa-735653bb",
    "Routes": [
        {
            "Match": {
                "HTTP": {
                    "PathPrefix": "/my-service/v1"
                }
            },
            "Destination": {
                "Service": "my-service",
                "RequestHeaders": {}
            }
        }
    ],
    "Meta": {
        "consul-api-gateway/k8s/Gateway.Name": "digital-api-qa",
        "consul-api-gateway/k8s/Gateway.Namespace": "consul",
        "external-source": "consul-api-gateway"
    },
    "CreateIndex": 242705,
    "ModifyIndex": 242705
}
nathancoleman commented 1 year ago

@manobi thanks for calling that out. Fixed in https://github.com/hashicorp/consul-api-gateway/pull/414

nathancoleman commented 1 year ago

@manobi I'm asking around to see if anyone has encountered issues like the role bindings failing to apply at a scale of hundreds of roles/policies.

My understanding is that the missing role bindings are the only issue you're seeing at this point (given the fix in #414) and that everything works as expected when you manually apply those bindings. Is that accurate?

manobi commented 1 year ago

@nathancoleman Accurate. The ACL not found error is not restricted to API gateway, I can see it in other components that eventually reconcile.

It might be the problem mentioned by @mikemorris, if I have to rolebind manually it's not a huge problem.

I was more worried while I have no ideas what was going on. Thank you.

manobi commented 1 year ago

@nathancoleman Will the https://github.com/hashicorp/consul-api-gateway/pull/414 fix be automatically published to Docker Hub or is it a manual action? I'm looking forward to put my hands on it and maybe create another issue in consul-k8s to investigate the race condition in ACL, as it looks like there is no problems consul-api-gateway itself.

Seems unfair to hold the v0.5 release if there are no other issues.

nathancoleman commented 1 year ago

@manobi you'll see it published to Docker Hub in a few minutes after I merge https://github.com/hashicorp/consul-api-gateway/pull/416. The merge of #414 itself didn't publish because our tooling identified the CVE referenced in #416.

Edit: You can now see an updated set of tags out on https://hub.docker.com/r/hashicorppreview/consul-api-gateway/tags

manobi commented 1 year ago

Just to confirm that I've got the URLrewrite back to work with: hashicorppreview/consul-api-gateway:0.5-dev-55da4a56cda79d0e97a7f2d40f503923ff57ba62

Thank you @nathancoleman

nathancoleman commented 1 year ago

@codex70 @manobi I believe this particular issue can be closed now but wanted to run it by you first. Thoughts?

The upcoming v0.5.0 release of Consul API Gateway will allow you to run the API gateway controller and create Gateways that route to services within the same datacenter whether that datacenter is a primary or secondary datacenter.

manobi commented 1 year ago

We should close it. Thanks

codex70 commented 1 year ago

Just to confirm I have been able to test this and it is now working following on from the fix for: https://github.com/hashicorp/consul-k8s/issues/1344