Config Entry replication of ingress-gateway entries fails validation in secondary datacenter

crhino commented 3 years ago

Overview of the Issue

Config Entry replication fails to apply properly in the secondary datacenter, blocking replication from finishing. This is caused by ingress-gateway config validation being dependent on an existing proxy-defaults entry.

I could imagine that this is an issue with any config entry that is dependent on another entry for setting properties like the protocol of the service.

Reproduction Steps

Steps to reproduce this issue, eg:

Create 2 datacenters
Create an ingress gateway config entry with an http listener for a service defined, and a proxy-defaults entry that sets everything to http protocol
Watch the secondary DC to see the replication of config entries

This does not reproduce all of the time, I think because sometimes the secondary DC will replicate the proxy-defaults entry before the ingress-gateway entry is added in my setup.

My Config Entries:

Created via consul config write CLI command:

{
  "kind": "ingress-gateway",
  "name": "ingress1",
  "listeners": [
    {
      "protocol": "http",
      "port": 443,
      "services": [
        {
          "name": "*"
        }
      ]
    },
    {
      "protocol": "http",
      "port": 444,
      "services": [
        {
          "name": "virtual"
        }
      ]
    }
  ]
}

Set in the config of the primary servers:

  "config_entries": {
    "bootstrap": [
      {
        "kind": "proxy-defaults",
        "name": "global",
        "config": {
          "protocol": "http"
        }
      },
      {
        "kind": "service-router",
        "name": "counting",
        "routes": [
          {
            "destination": {
              "NumRetries": 3,
              "RetryOnConnectFailure": true
            }
          }
        ]
      }
    ]
  },

Log Fragments

    2020-12-03T16:33:52.006Z [DEBUG] agent.server.replication.config_entry: finished fetching config entries: amount=3
    2020-12-03T16:33:52.007Z [DEBUG] agent.server.replication.config_entry: Config Entry replication: local=0 remote=3
    2020-12-03T16:33:52.007Z [DEBUG] agent.server.replication.config_entry: Config Entry replication: deletions=0 updates=3
    2020-12-03T16:33:52.007Z [DEBUG] agent.server.replication.config_entry: Updating local config entries: updates=3
    2020-12-03T16:33:52.008Z [WARN]  agent.server.replication.config_entry: replication error (will retry if still leader): error="failed to update local config entries: Failed to apply config upsert: service "virtual" has protocol "tcp", which does not match defined listener protocol "http""

crhino commented 3 years ago

Note that I needed to patch 1.9.0 with https://github.com/hashicorp/consul/pull/9320 in order to actually see the error.

mikemorris commented 3 years ago

This sounds like a suspiciously similar replication logic issue to https://github.com/hashicorp/consul/issues/9271#issuecomment-735971376

woz5999 commented 3 years ago

i'm experiencing this same issue in 1.9.1 despite #9271 being closed.

crhino commented 3 years ago

Unfortunately #9271 does not address this specific issue, although they are similar.

woz5999 commented 3 years ago

this state also occurs if setting the protocol via service-defaults. same race condition and failure.

woz5999 commented 3 years ago

fwiw a workaround is to delete the affected ingress-gateway configs from the primary datacenter, allow the other required configs to replicate, and then recreate the deleted ingress-gateway config.

it's lame, disruptive, and fragile, but it'll at least unblock replication. otherwise, using non-tcp protocol ingress-gateway listeners with federated clusters is a gamble at best and definitely not suitable for production use until this is fixed.

woz5999 commented 3 years ago

seems like this might be the same issue as #9196

dsolsona commented 3 years ago

We are also suffering from this issue in our Consul federated clusters and I can confirm @woz5999 workaround works, but this definitely something you don't want to do in production.

rrijkse commented 3 years ago

Just wanted to drop a note here, the work around specified above is quite hard to implement when it affects a lot of other services. An alternative workaround is to temporarily create a config entry of kind service-defaults for the virtual service with the protocol set to whichever it is expecting. This caused replication to resume for me and the proxy-defaults to take effect.

chrisboulton commented 2 years ago

If you're still experiencing this like we are, it's due to the sort algorithm used when applying config entries during replication. The current implementation pretty much does an alpha sort to determine the order, and because proxy-defaults > ingress-gateway, the sort order is out: we want proxy-defaults before ingress-gateways (and probably any other type of config entry too). This works for service-defaults and service-router/service-resolver because well.. the alphabet.

A quick patch which sorts proxy-defaults first is here: https://github.com/bigcommerce/consul/commit/85b4fcee4b72df36d75ce32cff019612fd4ff224. This works for us - once it's installed on a leader in a secondary DC you should be good to go.

A better/more improved fixed would be to configuration entries properly based on their dependencies, or maybe relax the validation when replicated entries are being replied.

rboyer commented 2 years ago

A partial fix for most scenarios should go out in the next patch release of consul 1.11, 1.10, and 1.9 due to: https://github.com/hashicorp/consul/pull/12307

hashicorp / consul