Consul doesn't perform Health Checks regularly for an external service

notBscalE commented 4 years ago

Overview of the Issue

As an extra security measure (beyond the best practice), we've decided that Vault would reside on a separated Consul cluster (opposed to deploying on the same cluster between separating on different Consul DCs). We have configured access to vault through our production (non-vault) cluster as an External Service, using HTTP health checks to the vault cluster. Unfortunately, it doesn't work as expected, as we're experiencing delayed and inconsistent health check information on our servers. To emphasize: currently, when we use the API health check for Vault manually, we see that all Vault nodes are working as expected, although in Consul, the health checks claim that one of our servers is down. When the raft leader changes, the supposed timeout error is moving to another Vault node. Both results are relevant for health checks done months ago.

As we are pretty stuck in the current situation now and finding no information about it, we would like to know if the issue is fixable, considered as a bug and still relevant or has been fixed already.

Reproduction Steps

Steps to reproduce this issue, eg:

Create a cluster with 7 server nodes on 3 different dcs, operating under the same Consul dc. One of the DCs is operating as a witness, with opened ports only to the Consul Servers.
Register Vault nodes as external nodes, with HTTP GET Health Check in HTTPS for /v1/sys/health.
Configure interval of 5 seconds, timeout of 1 second.
Keep it working until the nodes show inconsistencies.

Consul info

Server info

``` build: version = 1.6.1 ```

Operating system

RHEL 7.4, Linux 3.10.0-693.11.6.e17.x86_64

Log Fragments

Unfortunately we didn't find any relevant log segment for the situation here.

pierresouchay commented 4 years ago

We did something a bit similar with a small change:

We have a primary Consul Cluster for discovery/KV

Vault has its own Consul Cluster On Vault's machine, there is an agent on Primary Consul cluster we register checks locally.

Your own issue might be:

ACL issue (the value of healthchecks cannot be updated), behavior might vary depending on how you registered it and of the Consul's version
Or simply, how do you configure the external nodes? In theory, such nodes needs that an agent, somehow updates the checks. Just publishing the Service and its checks does not mean any agent is performing the check itself (and even in that case, it might be an issue if the given agent is down)

If you are in that case, performing a similar setup as ours would solve your issue.

notBscalE commented 4 years ago

Hello, sorry for the late answer, I had other stuff to close.

We still want to try to keep it as an External Node, and not to connect the vault servers themselves to our discovery cluster to keep them as separated as possible. Since it's an external service, I don't believe it's an ACL-related issue, since Consul itself is performing the checks.

The definitions from the catalog go as follows:

"Definitions": {
  "Interval": "5s",
  "Timeout": "1s",
  "HTTP": "https://vaultcluster.mydomain/v1/sys/health"
}

We have deployed ESM to perform remote checks on the Consul servers, and marked the nodes as external as we registered them to the catalog.

If all else fails, I feel like deploying a configuration similar to yours, although I hope it's safe enough security-wise for our demands.

hashicorp / consul