Closed crhino closed 2 years ago
Note that I needed to patch 1.9.0
with https://github.com/hashicorp/consul/pull/9320 in order to actually see the error.
This sounds like a suspiciously similar replication logic issue to https://github.com/hashicorp/consul/issues/9271#issuecomment-735971376
i'm experiencing this same issue in 1.9.1 despite #9271 being closed.
Unfortunately #9271 does not address this specific issue, although they are similar.
this state also occurs if setting the protocol via service-defaults
. same race condition and failure.
fwiw a workaround is to delete the affected ingress-gateway
configs from the primary datacenter, allow the other required configs to replicate, and then recreate the deleted ingress-gateway
config.
it's lame, disruptive, and fragile, but it'll at least unblock replication. otherwise, using non-tcp protocol ingress-gateway listeners with federated clusters is a gamble at best and definitely not suitable for production use until this is fixed.
seems like this might be the same issue as #9196
We are also suffering from this issue in our Consul federated clusters and I can confirm @woz5999 workaround works, but this definitely something you don't want to do in production.
Just wanted to drop a note here, the work around specified above is quite hard to implement when it affects a lot of other services. An alternative workaround is to temporarily create a config entry of kind service-defaults
for the virtual
service with the protocol set to whichever it is expecting. This caused replication to resume for me and the proxy-defaults
to take effect.
If you're still experiencing this like we are, it's due to the sort algorithm used when applying config entries during replication. The current implementation pretty much does an alpha sort to determine the order, and because proxy-defaults > ingress-gateway
, the sort order is out: we want proxy-defaults
before ingress-gateways
(and probably any other type of config entry too). This works for service-defaults
and service-router
/service-resolver
because well.. the alphabet.
A quick patch which sorts proxy-defaults
first is here: https://github.com/bigcommerce/consul/commit/85b4fcee4b72df36d75ce32cff019612fd4ff224. This works for us - once it's installed on a leader in a secondary DC you should be good to go.
A better/more improved fixed would be to configuration entries properly based on their dependencies, or maybe relax the validation when replicated entries are being replied.
A partial fix for most scenarios should go out in the next patch release of consul 1.11, 1.10, and 1.9 due to: https://github.com/hashicorp/consul/pull/12307
Overview of the Issue
Config Entry replication fails to apply properly in the secondary datacenter, blocking replication from finishing. This is caused by
ingress-gateway
config validation being dependent on an existingproxy-defaults
entry.I could imagine that this is an issue with any config entry that is dependent on another entry for setting properties like the
protocol
of the service.Reproduction Steps
Steps to reproduce this issue, eg:
http
listener for a service defined, and aproxy-defaults
entry that sets everything tohttp
protocolThis does not reproduce all of the time, I think because sometimes the secondary DC will replicate the
proxy-defaults
entry before theingress-gateway
entry is added in my setup.My Config Entries:
Created via
consul config write
CLI command:Set in the config of the primary servers:
Log Fragments