hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.32k stars 4.42k forks source link

Service registration data synchronization #11438

Open Gerrylinux opened 2 years ago

Gerrylinux commented 2 years ago

Service registration. If it is synchronized with other nodes in the cluster, other nodes can provide registration information after the outage. image

Gerrylinux commented 2 years ago

High availability of service registry information

Gerrylinux commented 2 years ago

In the current situation, a Consul service in the cluster is down and registration information on this service cannot be obtained

Amier3 commented 2 years ago

Hello!

I see this is your first issue opened so welcome to the consul community! Just to confirm, are you asking for a feature enhancement where registration information is replicated across consul nodes for HA purposes?

Gerrylinux commented 2 years ago

Thank you for your reply. Yes, at present, we use consul for automatic discovery of Prometheus monitoring information, but the registration information is not highly available. When the consul node is down or maintained, the registration information data will be lost. Therefore, you need to be able to copy registration information between nodes

blake commented 2 years ago

Hi @Gerrylinux,

When the consul node is down or maintained, the registration information data will be lost. Therefore, you need to be able to copy registration information between nodes.

This is by design. From the Basic Architecture of Consul:

Every node that provides services to Consul runs a Consul agent…The agent is responsible for health checking the services on the node as well as the node itself…The servers maintain a catalog, which is formed by aggregating information submitted by the agents. The catalog maintains the high-level view of the cluster, including which services are available, which nodes run those services, health information, and more.

The key point is that the catalog aggregates service registration information that is submitted by agents. Service registrations are not replicated between agents. If a service is deployed across 3 different nodes, it must be registered with the individual Consul agents running on each node.

If a node A goes down, Consul will remove the service entries associated with node A from the service catalog. The catalog will then only contain service information for the remaining, available service instances running on nodes B and C.

lycclsltt commented 2 years ago

Why is this? I don't understand. I think when a node goes down, the data can be found on other nodes, rather than lost data on this node @blake @Gerrylinux

Gerrylinux commented 2 years ago

My current solution is that the service submits the registration request to all nodes in the cluster, ensuring that each cluster node has the registration information of the client service to realize the load.

zffocussss commented 2 years ago

is that ok for you?

zffocussss commented 2 years ago

consul is not consistence-eventuall?

blake commented 2 years ago

Why is this? I don't understand. I think when a node goes down, the data can be found on other nodes, rather than lost data on this node

@lycclsltt It is because agents advertise to the rest of the cluster the set of services which are available on that agent.

For example, say you have nodes 1 and 2 in your environment, and two services–service A and service B. Service A is deployed on both nodes 1 and 2, and separately registered with each node.

Service B issues a service discovery query to Consul for service A, and Consul returns the IP/port info of the service running on both node 1 and 2.

If node 1 goes down, the health check for service instance A on that machine will be marked as failed. As such, Consul's catalog will no longer return information for the instance of service A running on node 1. It will only return information about the remaining healthy instance that is registered and still running on node 2.

In order to have a service be highly available, you need to run the service across multiple nodes, and ensure the service is properly registered with its local node so that the particular service instance is discoverable in the service catalog.

See Introduction to HashiCorp Consul for a visual explanation of this–starting at the 5min mark through 6:25.

zffocussss commented 2 years ago

Why is this? I don't understand. I think when a node goes down, the data can be found on other nodes, rather than lost data on this node

@lycclsltt It is because agents advertise to the rest of the cluster the set of services which are available on that agent.

For example, say you have nodes 1 and 2 in your environment, and two services–service A and service B. Service A is deployed on both nodes 1 and 2, and separately registered with each node.

Service B issues a service discovery query to Consul for service A, and Consul returns the IP/port info of the service running on both node 1 and 2.

If node 1 goes down, the health check for service instance A on that machine will be marked as failed. As such, Consul's catalog will no longer return information for the instance of service A running on node 1. It will only return information about the remaining healthy instance that is registered and still running on node 2.

In order to have a service be highly available, you need to run the service across multiple nodes, and ensure the service is properly registered with its local node so that the particular service instance is discoverable in the service catalog.

See Introduction to HashiCorp Consul for a visual explanation of this–starting at the 5min mark through 6:25.

@blake Can application code connect to consul servers directly if the local agents are not working fine?so I worry about that case in which application services are running,but local consul agents are not health.

blake commented 2 years ago

@blake Can application code connect to consul servers directly if the local agents are not working fine?

Applications can connect directly to the Consul servers, but that's not recommended. The application should communicate with the local Consul agent in order to take advantage of agent caching, and offload service discovery requests from the servers.

so I worry about that case in which application services are running,but local consul agents are not health.

What particular failure scenarios are you trying to account for?

If applications are healthy, but the local agent is not, this will affect the ability for services running on other machines to make new connections to services co-located on the host with the failed agent. Existing connections that were already established will continue to operate. The same is true for outgoing connections from services on the host.

Since the agent is down, the health checks associated with applications running on that host will be marked as unhealthy. If you continue to route traffic to those apps, you risk sending traffic to applications which are truly unhealthy (for example, an OS issue caused both applications and the Consul agent to fail).

A better architecture would be to run multiple copies of your application across different servers. If a Consul agent on a host is unhealthy, the traffic is diverted away to healthy instances on other hosts in the environment until the agent failure can be investigated and resolved.