hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.3k stars 4.42k forks source link

Allow Prepared Queries to return Critical Nodes #4983

Open jjathman opened 5 years ago

jjathman commented 5 years ago

Feature Description

Allow for a new parameter of prepared query which will allow the query to return failing nodes.

Use Case(s)

We are looking at changing our usage of Consul to move from the /health/service endpoint to using prepared queries. Occasionally, we would like to use a service in a critical status as the final fallback if there are no available services in a passing or warning state. Right now the service endpoint will return those nodes, we would like to be able to replicate that behavior with a prepared query.

Ideally, the failing nodes would only be returned as the last ones.

banks commented 5 years ago

Hey @jjathman, thanks for the details.

Can you describe whether you would consume this via DNS or HTTP API (or both). If we add it we'd need to make it work for both which leads to fun dealing with DNS response truncation correctly when you have "order" to take into account. For example, we con't control which "order" DNS clients choose to consume results in. SRV records can include a weight so we can set that lower than other nodes but just ensuring they are last in the response doesn't actually guarantee that clients are less likely to use them in DNS A responses in general.

jjathman commented 5 years ago

Hi @banks,

We would use this via the HTTP API only. Would it be possible to make this an HTTP only option? I can appreciate the difficulties in having this option via DNS. If no one has asked for the feature with DNS maybe it isn't necessary?

We have internally debated using services in a Error status as a final option, but it seems like it is the best thing to try if there are no other options available.

banks commented 5 years ago

@jjathman if it's only for working around unexpected incidents then there is already a way to do something similar with prepared queries, with manual intervention.

We added field to prepared queries for ignoring certain checks . For example some people have found that if they have some bug or incident where one of their service health checks is returning an error, they can just update the prepared query to ignore that check temporarily so instances are still served until the check is fixed.

Would that be sufficient for your case? It's different than needing all services via prepared query for some other reason like trying to sync the whole set with a loadbalancer that does it's own health probes granted, but interested to know if you have specific needs that fall outside of that?

banks commented 5 years ago

Your request is actually more-or-less the same thing as IgnoreCheckIDs: ["*"] (if that worked). We may want to make it more explicit like IncludeCritical: true though given that we have to add something to make it work.

I guess the real question is: do you really want everything all the time, or do you only want critical if there are no (or fewer than some threshold) healthy instances?

I could imagine for example adding a "min_healthy_threshold" which you might set to say 40% and only if there are fewer than 40% f the registered instances in a healthy state would it start returning critical. This is similar to how other load balances implement safety valves on circuit breakers.

jjathman commented 5 years ago

Regarding your first question, yes I think your idea of using the IgnoreCheckIDs would work for us. I saw that field but it didn't occur to me to use that for this purpose. Thank you for the tip.

Trying to make this work I expected making a prepared query like this would do what I expected, but it doesn't seem to be returning services in a critical status.

{
  "Name": "",
  "Template": {
    "Type": "name_prefix_match"
  },
  "Service": {
    "Service": "${name.full}",
    "IgnoreCheckIDs": ["service:${name.full}"]
  }
}

I thought this would ignore the health check of the application. I must be doing something wrong.

For your second question, in this specific use case I think we would only want/use critical services if they were the only option. However, I think there is a more general use case where I would want to see everything and deal with the results at the client level. When I originally started looking at prepared queries I assumed it would be possible to retrieve any service instances that I can see via the /health endpoint, but maybe that wasn't the intention of using a prepared query.

I would be in favor of your IncludeCritical: true idea. That seems the most similar to the OnlyPassing option that currently exists.

jjathman commented 5 years ago

Nevermind my question about why it wasn't working for me, we are on too old of a version for the ignore checks. I'll upgrade us now and test it again.

stale[bot] commented 4 years ago

Hey there, We wanted to check in on this request since it has been inactive for at least 60 days. If you think this is still an important issue in the latest version of Consul or its documentation please reply with a comment here which will cause it to stay open for investigation. If there is still no activity on this issue for 30 more days, we will go ahead and close it.

Feel free to check out the community forum as well! Thank you!

clkamp commented 4 years ago

This feature would be extremely nice to have, because vault creates checks with unique names on each node, so IgnoreCheckIDs does not work for me, because it only matches exactly

cbroglie commented 4 years ago

I could imagine for example adding a "min_healthy_threshold" which you might set to say 40% and only if there are fewer than 40% f the registered instances in a healthy state would it start returning critical. This is similar to how other load balances implement safety valves on circuit breakers.

This is exactly what I'm looking for, essentially an automated way to set IgnoreCheckIDs: ["*"] (again, pretending that works) in case of widespread failure.

urusha commented 1 year ago

It seems that IgnoreCheckIDs: ["*"] currently doesn't work, and we have no variable to dynamically populate checkids so there is no way to return all services including critical in DNS. Actually I'd like to suggest another parameter for Service: States (list with default value: ["normal","warning"] which is the same as OnlyPassing: false). Available list values are normal, warning, critical. This will allow to create prepared queries like this:

{
    "Name": "crit-",
    "Template": {
        "Type": "name_prefix_match",
        "Regexp": "^crit-(.+)$"
    },
    "Service": {
        "Service": "${match(1)}",
        "States": ["critical"]
    }
}

To match critical services only.