Open vector623 opened 4 years ago
Hi @vector623, thanks for opening this issue, it seems that the health-check may not be the problem as the issue still appears when I try without it.
I will investigate and let you know what I found.
Hi @vector623, we've looked and I think you are trying to register a service on a node where a Consul agent is running (an internal service). The consul_service
resource was created to register external services and adds the service to the Consul catalog but not to the local catalog of the agent. When the agent perform the anti-entropy syncs, it finds a service in the catalog it knows nothing about and removes it:
Mar 25 18:57:23 srv-pro-schd-05 consul[32576]: 2020-03-25T18:57:23.181Z [DEBUG] agent: Node info in sync
Mar 25 18:57:23 srv-pro-schd-05 consul[32576]: 2020-03-25T18:57:23.183Z [INFO] agent: Deregistered service: service=redis
Mar 25 18:57:23 srv-pro-schd-05 consul[32576]: 2020-03-25T18:57:23.184Z [INFO] agent: Deregistered check: check=service:redis1
The documentation of the provider (https://www.terraform.io/docs/providers/consul/r/service.html) mentions this briefly:
If the Consul agent is running on the node where this service is registered, it is not recommended to use this resource.
This is not related to the health-check and you should see the same behaviour when registering the service without the health-checks.
You mentioned that the same service created using cURL works, I think you are creating it using /v1/agent/service/register
and not the /v1/catalog/register
endpoint consul_service
is using. Could you confirm that?
The consul_agent_service
resource can be used to create an internal service but it was marked as deprecated and does not support health-checks at the moment. I'm wondering if we should rollback this deprecation.
@remilapeyre wouldnt it be possible to combine the two, and abstract away that complexity to users? Healthchecks ftw!!
Still cannot get TCP Health checks working, let alone HTTP health checks. Lets take two services as an example: Prometheus, which has to be configured using TCP checks on port 9090
and Grafana, which can be checked with a GET /api/health
request on port 3000
.
Tested on Consul v1.15.3
I have Prometheus running on IP 192.168.55.120
:
$ curl -i 192.168.55.120:9090
HTTP/1.1 302 Found
Content-Type: text/html; charset=utf-8
Location: /graph
Date: Tue, 11 Jul 2023 18:15:55 GMT
Content-Length: 29
<a href="/graph">Found</a>.
If Prometheus does HTTP responses, then it is surely giving out a collection of TCP packets.
I have Grafana running on IP 192.168.55.121
:
$ curl -i 192.168.55.121:3000/api/health
HTTP/1.1 200 OK
Cache-Control: no-store
Content-Type: application/json; charset=UTF-8
X-Content-Type-Options: nosniff
X-Frame-Options: deny
X-Xss-Protection: 1; mode=block
Date: Tue, 11 Jul 2023 18:17:30 GMT
Content-Length: 71
{
"commit": "5a30620b85",
"database": "ok",
"version": "10.0.1"
}
Grafana working as well.
No lets create the necessary service healthcheck resources.
Configuring Health checks for Prometheus
resource "consul_node" "node" {
count = 1
datacenter = "dc1"
address = "192.168.55.120"
name = "prometheus01"
}
resource "consul_service" "svc" {
count = 1
name = "prometheus01"
node = "prometheus01"
address = "192.168.55.120"
datacenter = "dc1"
port = 9090
check {
check_id = "service:prometheus01"
name = "Prometheus Health Check"
notes = "Checks for a TCP connection on port 9090"
tcp = "192.168.55.120:9090"
interval = "10s"
timeout = "2s"
deregister_critical_service_after = "60s"
}
}
Configuring Health checks for Grafana
resource "consul_node" "node" {
datacenter = "dc1"
address = "192.168.55.121"
name = "grafana01"
}
resource "consul_service" "svc" {
name = "grafana01"
node = "grafana01"
address = "192.168.55.121"
datacenter = "dc1"
port = 3000
check {
check_id = "service:grafana01"
name = "Grafana Health Check"
http = "/api/health"
notes = "Checks for a GET /api/health request on port 3000"
tls_skip_verify = true
method = "GET"
interval = "10s"
timeout = "2s"
deregister_critical_service_after = "30s"
header {
name = "Accept"
value = ["application/json"]
}
}
}
With what has been demonstrated above, I have three questions:
Relevant issues: #124
Hi @mbrav . Not sure if I'm doing archaeology here, but I just struggled through this myself. This looks like a non-issue to me, although it didn't at first. It's non-issue because although the service and the service health check are declared, there is no external service monitor to actually perform the health checks.
I run consul_esm
on my Nomad cluster to perform the health checks.
So, registered services start off critical, but are updated to healthy as they are discovered by consul-esm
and their health checks are performed.
Terraform Version
Affected Resource(s)
Terraform Configuration Files
Debug Output
https://gist.github.com/vector623/d193f3292790bf7f1119c57bafd4e561
Expected Behavior
Health check should execute successfully. If it fails, it should not deregister for 90 minutes.
Actual Behavior
Health check fails and deregisters within a minute.
Steps to Reproduce
Please list the steps required to reproduce the issue, for example:
terraform init
terraform apply -auto-approve
Important Factoids
References
Are there any other GitHub issues (open or closed) or Pull Requests that should be linked here? For example: