Unable to deregister a service

drsnyder commented 9 years ago

I brought this to the attention of the mailing list here. @slackpad asked me to go ahead and file a bug. Below summary of the issue from the discussion thread.

We have services that are being orphaned and we cannot deregister them. The orphans show up under one or more of the master nodes. In our configuration the master nodes are dev-consul, dev-consul-s1, and dev-broker.

The health check of the orphaned node looks something like the following:

{
    "Node": "dev-consul",
    "CheckID": "service:discussion_8080",
    "Name": "Service 'discussion' check",
    "ServiceName": "discussion",
    "Notes": "",
    "Status": "critical",
    "ServiceID": "discussion_8080",
    "Output": ""
}

I attempted to deregister via:

user@dev-consul $ curl -X PUT -d '{"CheckID": "service:discussion_8080", "ServiceID": "discussion_8080", "Node": "dev-consul", "Datacenter": "dev"}' http://localhost:8500/v1/catalog/deregister

The node was removed but then reappears within 30-60s. As @slackpad's recommended, I tried deregistering with:

user@dev-consul $ curl -v http://localhost:8500/v1/agent/service/deregister/discussion_8080
user@dev-consul $ curl -v -X PUT -d'{"CheckID": "service:discussion_8080", "ServiceID": "discussion_8080", "Node": "dev-consul", "Datacenter": "dev"}' http://localhost:8500/v1/catalog/deregister

Both commands returned status 200 OK. But that also failed. You can see the output in this gist as well as the debug logs from consul.

From the debug logs in consul we see:

Aug 20 16:57:45 dev-broker consul[2221]: agent: Deregistered service 'discussion_8080'
Aug 20 16:57:45 dev-broker consul[2221]: agent: Check 'service:discussion_8080' in sync
Aug 20 16:57:45 dev-broker consul[2221]: agent: Deregistered check 'service:discussion_8080'
Aug 20 16:57:45 dev-broker consul[2221]: http: Request /v1/agent/service/deregister/discussion_8080 (19.73968ms)
Aug 20 16:57:46 dev-broker consul[2221]: http: Request /v1/agent/check/pass/service:discussion_8080, error: CheckID does not have associated TTL
Aug 20 16:57:46 dev-broker consul[2221]: http: Request /v1/agent/check/pass/service:discussion_8080 (246.298µs)
Aug 20 16:57:47 dev-broker consul[2221]: agent: Synced service 'discussion_8080' <--- SHADY!

The annotation is from @slackpad.

It's also noteworthy that the orphans are always associated with one of the master nodes (e.g. dev-consul) and not the node (dev-mesos) that's running the service that was registered. I should also mention (it could be a coincidence) that the service (discussion) is also flapping though from what I can tell from the debug logs for consul on dev-mesos everything is fine.

Our consul version:

$ consul version
Consul v0.5.2
Consul Protocol: 2 (Understands back to: 1)

Thanks!

drsnyder commented 9 years ago

I'm not sure if this helps with the solution to the problem, but the services can be deregistered if you deregister them on all of the servers in the cluster using the local agent at more or less the same time. See this tool for what we used to force the de-registration.

So in our case, I ran the linked tool above on the three servers in the cluster. It removed about 250 orphaned services that couldn't otherwise be deregistered.

milosgajdos commented 9 years ago

We are seeing something equally obscure in consul I'm completely at loss of understanding what is going on but it seems somewhat similar to the issue described above.

Consul version:

# consul version
Consul v0.5.2
Consul Protocol: 2 (Understands back to: 1)

One of the services which is registered with consul dies, but consul never removes the registered key., although it does seems like it does. We figured we would use the consul's HTTP API to deregister the service. Pointless exercise, as we learnt later. Even though the consul seems to think, for a bit of time, that the record has been removed, after a bit of time the removed data re-appears out of the blue, and we are totally clueless why.

Here's the actual description:

We can curl the registered service at the beginning as expected

$ curl node1:8500/v1/catalog/service/my_service | python -mjson.tool
[
    {
        "Address": “1.2.3.4”,
        "Node": “my_service”,
        "ServiceAddress": "0.0.0.0",
        "ServiceID": "my_service:9042",
        "ServiceName": "my_service",
        "ServicePort": 9042,
        "ServiceTags": null
    }
]

We can query consul and receive the reply easily as one would expect correctly (ignore the actual IP):

$ dig -p 8500 @consul_node1 my_service.service.dc1.consul +short
1.2.3.4
$

$ dig -p 8500 @consul_node2 my_service.service.dc1.consul +short
1.2.3.4
$

$ dig -p 8500 @node3 my_service.service.dc1.consul +short
1.2.3.4
$

Now we try to deregister the service. This is the JSON payload:

$ cat my_service.json
{
  "Datacenter": "dc1",
  "Node": "my-service",
  "ServiceID": "my-service:9042"
}

We PUT it to the leader in the cluster (which is node1) - this goes fine as expected on every node in the cluster:

$ curl -X PUT -d @my_service.json node1:8500/v1/catalog/deregister
true
$
$ curl node1:8500/v1/catalog/service/my_service
[]
$
$ curl node2:8500/v1/catalog/service/my_service
[]
$
$ curl node3:8500/v1/catalog/service/my_service
[]
$

Then in about a minute or so, this happens:

$ tail logs (on node1)
…
…
2015/09/30 19:21:52 [INFO] agent: Synced service 'my_service:9042'
$

Curling service catalog indeed returns the entry:

$ curl node1:8500/v1/catalog/service/my_service | python -mjson.tool
[
    {
        "Address": “1.2.3.4”,
        "Node": “my_service”,
        "ServiceAddress": "0.0.0.0",
        "ServiceID": "my_service:9042",
        "ServiceName": "my_service",
        "ServicePort": 9042,
        "ServiceTags": null
    }
]

Now, can someone tell me what is going on here ?

drsnyder commented 9 years ago

I don't know the specifics of what's going on but what we have learned is that when this happens you have to deregister the service from all the consul servers. So, if you have three then you need to deregister the service from all three.

We have been this tool to clean them up. We have plans to productize it as an orphaned service reaper but we aren't there yet.

milosgajdos commented 9 years ago

Thanks, I'll check it out. Nevertheless this is something I'd love to understand as random data re-appearance does not fill me with confidence if I'm entirely honest.

Bugs happen in every SW, but I'd love to understand the actualy underlying problem so it does not surprise me at 3AM in the morning as it always happens in Murphy's law.

volkantufekci commented 9 years ago

Hi, I'm running a single node consul(v.0.5.2) and similar issue here. I unregister a service via diplomat(ruby client) and the consul says in its stdout:

2015/11/03 11:33:08 [INFO] agent: Deregistered service 'vcs4'

But "vcs4" can still be observed in http api and web ui.

volkantufekci commented 9 years ago

My issue is solved. The problem was the message in consul's output. It says a service is "deregistered" even it doesn't exist. For example I don't have a service registered with serviceID "THIS_DOES_NOT_EXIST" but when I call

curl  http://CONSUL_AGENT_URL:8500/v1/agent/service/deregister/THIS_DOES_NOT_EXIST

Consul logs as:

2015/11/03 15:11:56 [INFO] agent: Deregistered service 'THIS_DOES_NOT_EXIST'

So, in my case I was trying to deregister with a wrong serviceID, and consul's output was misleading me as it says service is deregistered instead of warning me as "there is no such service with that id"...

thpham commented 8 years ago

Hello,

I got similar problem trying to deregister services created with registrator for docker container. It tooks me a half a day to notice that the ServiceID was generated with special characters and I had to call the API endpoint with an URL-ENCODED string ! @milosgajdos83 , if I take your previous example you should call the API like this:

curl -v -X PUT http://CONSUL_AGENT_URL:8500/v1/agent/service/deregister/my_service%3A9042

CONSUL_AGENT_URL should be the node hostname/ip where the agent registered the service.

hope It will help some people :-)

codelotus commented 8 years ago

It would appear as though the error checking in the consul http api is not complete. (I have not looked at the code to verify this). Hence why a successful response from a failed deregistration. @milosgajdos83 I was able to successfully register and deregister your service by changing the format of the json and by using the /v1/catalog/ endpoint.

To register a service

curl -XPUT -d @consulServiceRegister.json http://localhost:8500/v1/catalog/register

where consulServiceRegister.json is:

{ 
  "Address": "1.2.3.4",
  "Node": "test-node", 
  "Address": "0.0.0.0",
  "Service": {
    "ID": "my_service:9042",
    "Service": "my_service",
    "Port": 9042 
  }
}

To deregister a service (note the Address is required):

curl -XPUT -d @consulServiceDeRegister.json http://localhost:8500/v1/catalog/deregister

where consulServiceDeRegister.json is:

{ 
  "Datacenter": "test-dc",
  "Node": "test-node",
  "ServiceID": "my-service:9042",
  "Address": "1.2.3.4"
}

At this point the registered service has been successfully deregistered and after 15 minutes the service has not returned:

curl http://localhost:8500/v1/catalog/services                                                   
{"consul":[]}

javaxplorer commented 8 years ago

The example from @codelotus works as long as you register with the catalog and not with the agent. If you do the following call:

curl -XPUT -d @consulServiceRegisterAgent.json http://10.98.204.21:8500/v1/agent/service/register

Where consulServiceRegisterAgent.json is:

{
  "ID": "my_service:9042",
  "Name": "my_service",
  "Address": "1.2.3.4",
  "Port": 9042
}

And then do a deregister:

curl -XPUT -d @consulServiceDeRegister.json http://localhost:8500/v1/catalog/deregister

where consulServiceDeRegister.json is:

{ 
  "Datacenter": "test-dc",
  "Node": "test-node",
  "ServiceID": "my-service:9042",
  "Address": "1.2.3.4"
}

The service will respawn in a minute or so :(

cabrinoob commented 8 years ago

Same problem here. I have Zombies services which come back to life whatever deregistering technics I use.

peterklipfel commented 8 years ago

I'm load balancing with consul-template, and this is causing me some major headaches. Round robin load balancing to services that may or may not exist creates ridiculous, cascading networking bugs.

What I found was that the master said that one of my members had failed, but that member thought that it was still alive. I made the member leave, and then rejoin. This fixed the issue.

slackpad commented 8 years ago

Wanted to clarify - I think there are a few things going on here in this issue:

The original problem posted by @drsnyder looks to be an issue with services registered on the Consul servers - that is an outstanding thing we need to track down.
The error checking problem pointed out by @volkantufekci needs to be fixed because that adds to confusion by returning bogus success responses.
The problems encountered by @milosgajdos83 and @cjhkramer look like a common source of confusion around using Consul. We need to beef up the docs on this - an explanation of this follows.

In Consul it's extremely rare to use the Catalog API directly. The Agent API (https://www.consul.io/docs/agent/http/agent.html) should almost always be used. For services running on Consul agents, the agent is the source of truth, not the catalog maintained by the servers. Periodically, the agents perform an anti-entropy sync and use the Catalog API internally to update the servers to have the correct state. This means that if you use the catalog API to deregister a service, it will disappear for a little while then the agent will put that back on the next sync. If you use the Agent API it will take care of removing the service from the catalog for you.

The call to https://www.consul.io/docs/agent/http/agent.html#agent_service_deregister should be made on the agent where the service is registered.

ch3lo commented 8 years ago

I had zombie services in /v1/catalog/service/... but not in /v1/agent/services. I did a "consul rejoin" in the agent related with the zombie and they disappeared. I think something rare are with the entropy sync from slaves to master.

doublerebel commented 8 years ago

Am being bitten by this today. Attempting to set maintenance mode on a nonexistent service correctly returns a 404. But, I can send any ID to a deregister endpoint and get a 200 OK, whether the service exists or not. I would expect any endpoint that takes an ID to return a 404 if that ID does not exist.

(I also can't seem to deregister a service with a . in the ID, despite that being a legal URL and not needing URL encoding EDIT: this might not be the case). Issues #1333, #1138, #1096 are related in case anyone there needs this thread.

I did notice that a successful service deregister also prints Deregistered check... to the logs. A nonexistent service has no checks. (I did make sure to do all this with the agent and not the catalog.)

Now I'm also wishing for a "deregister service" button in the UI, to solve this for me. Thanks all for your suggestions and helper examples.

josegonzalez commented 8 years ago

Seems like the file for that service actually still exists on a box, even when issuing a deregistration to that box (testing with a single consul instance).

Removing the service file on the box and deregistering didn't appear to fix it. Neither did removing the local.snapshot on it. Removing both the local and remote snapshot did have an effect though.

ghost commented 8 years ago

Hello. Is there any progress on this? I am having the issues described here and a lot of trouble.

kbroughton commented 8 years ago

same. Pretty major flaw. Consul-template picks up the old service.

lowzj commented 8 years ago

Hello. same problem here. I try to deregister some critical services from consul server that is stopped but not deregister correctly. Is there any progress?

babbottscott commented 8 years ago

For configuring a client consuming a service, would service health rather than service catalog be a more appropriate option? I may be underestimating the bug here, but ISTM service catalog is prone to extraneous data (either from new services not yet ready for consumption, or decommissioned services) by design.

alexykot commented 8 years ago

I can confirm that on version 0.6.4 I cannot reproduce this issue any more on a test setup.

I've build a small test setup with three consul agents sitting in containers on the same node talking to each other.

Then I've created a test service through endpoint PUT /v1/agent/service/register on one node, and confirmed it has propagated in seconds to other two agents and is available through GET /v1/catalog/services on each agent.

And when I deregistered service on the same agent it was created on with DELETE /v1/agent/service/deregister/test-service1 - it has gone away instantly from catalogs on all three nodes.

n8gard commented 8 years ago

I just stood up the Consul UI in our environment then killed some EC2 instances which means they didn't gracefully leave the system. I see them as failed nodes in the UI--so far so good. But, when I click the Deregister button, they do go away. However, upon reloading the UI, they are there again. Have done this many times. It very well could be something wrong on my side as this is a very new environment and I'm doing this for the first time, but, it sounds exactly like this issue.

I'm on v0.6.4 on Ubuntu 14.04 LTS.

skyrocknroll commented 8 years ago

@alexykot Actually the checks are registered by nomad in my case

ghost commented 8 years ago

@alexykot

And when I deregistered service on the same agent it was created on with DELETE /v1/agent/service/deregister/test-service1 - it has gone away instantly from catalogs on all three nodes.

have you tried to deregister from other nodes different than the one you have used to register? That's when they come back. I'm not sure if it is supposed to work this way though.

alexykot commented 8 years ago

@webertlima You cannot deregister a service from the agent on a different node, service only exists on the agent you have registered with. It also exists in the catalog on all nodes, but that is not related to the agent itself. And to be honest I don't understand why there is a catalog/deregister endpoint at all, in my opinion catalog should be a read-only service list.

ghost commented 8 years ago

@alexykot thanks for clearing that up.

flypenguin commented 8 years ago

I go crazy right now. I use consul as a service registry (which it apparently is), but I am completely unable to deregister services. I am trying to use the deregister endpoint, and I am having the exact same behavior - and it seriously f*cks with my network setup.

I use a consul-template to configure haproxy for services which appear and vanish). Because of my system setup I use only one central agent to register services with, and it seems I will be unable forever to deregister them.

This is a superbly bad situation, and I really do not understand the point of the /deregister endpoint if it can't be used, and I even with a read-only catalog I would assume I could remove services at some point. (What's the point of a distributed system if you have some weird logic about which nodes to use for some operations anyway?)

Update: I also tried de-registering on the node where the service runs, and still it's coming back.

I just. Don't. Get. It.

flypenguin commented 8 years ago

I have now managed to get rid of those services, by stopping all consul versions, killing the data directory, and re-starting them. this is not the way to go IMHO. for a single test case the de-registration now seems to work fine, for whatever reason. I am thinking of moving away from consul as fast as possible now, because this kind of undeterministic behavior makes it impossible to rely on it as a central infrastructure part, and consul currently is the backbone of my service management.

I really like consul though and would be super happy if there could be a solution for this.

slackpad commented 8 years ago

Hi @flypenguin sorry you are having trouble. There are some issues called out in https://github.com/hashicorp/consul/issues/1188#issuecomment-185977469, but Consul's behavior is definitely deterministic. I think you are running into problems because of this:

I use only one central agent to register services with

Consul's really not designed to run all registrations through a central set of agents. In Consul, the agent holds the information about which services are registered, and then takes responsibility for syncing that information up to the catalog maintained by the servers. If you delete a service from the catalog, the agent will put it back (which I agree is confusing and we need to document that more clearly). To remove a service you always need to remove it using the Agent API and it will remove it from the catalog for you.

If you run an agent on each node and always register/deregister using that agent for the services on that node then things should work properly (and if that node dies all of its services will eventually be reaped automatically). If you are running a small number of agents and registering everything through those, setting the addresses manually, it is easy to lose track of where a service was registered, making it hard to remove it. I'd strongly recommend against running Consul like this - it also prevents reaping as described above, and things like sessions from working properly. Consul is designed to have the agent running on each node in the cluster.

The other issue on here where you get a 200 when deleting even if a service doesn't exist adds to the confusion; we will also fix that. Sorry for the trouble - hopefully you can get things working well in your setup!

flypenguin commented 8 years ago

Okay, I understand. Three things come to my mind on this:

First, I actually did use the same instance for registering and de-registering. I also tried de-registration on ANY instance I have. It didn't work, and I assume it was because of the Agent-vs-nonAgent-API situation. This is - if at all - very poorly documented, and it seems to not work at all. So the very existence of this endpoint seems completely pointless (if not harmful) to me.

Second, I find the behavior very non-intuitive. I would expect a Consul cluster to propagate those events. IMO the nodes could still manage "their own" services even if not registered via them - the registered service should just "travel" to the node in question and stay there. Same, actually, with the deregistration. (or why bother including the service address if it really should only ever be "the" localhost?)

Third. Since "second" does not seem to behave like I thought it would I would expect a very clear, unmistakeable, prominent documentation about how this works, which I think the current one is not at the moment.

Some background maybe on my setup, so you understand why using a central instance seemed like a natural choice for me:

I am using Rancher to manage a containerized environment.
The services Rancher manages pop up on random hosts. I use a self-written service to enter those services in consul, to create a dynamic load balancing with consul template on it
The service registering the services in consul is running in a container as well, so it can't connect to "localhost" to get to Consul, nor does it know the actual host IP. That's why I give it a central consul instance to use. (Since consul is meant to be used differently as I realize now, I will change this behavior now)

Anyway, thanks a lot for your feedback!

ghost commented 8 years ago

@flypenguin Hi, I see you are (or were) making some of the same confusions I was making using the Consul Agent. Picture this: If you have ONE Consul Server ONLY (the Master), you use this instance to register and deregister your services through the HTTP API, no matter how many services and nodes you have. This MUST work, unless you are doing something wrong with the request.

Now if you have MORE THAN ONE Consul servers (Like 1, 2 or 3 Consul Masters and Other Consul Clients), if you use the instance 2 to REGISTER, you HAVE to use the same instance 2 to DEREGISTER. It doesn't matter what instance of Consul you use to Register A service, as long as you use the same instance to deregister that same service (same ID).

Good practice: run the Consul Client in every machine that runs services and use 127.0.0.1 to register/deregister. It'll work. If the Service is in a Docker container, use it's default route.

Keets2016 commented 8 years ago

I still can't remove zombie service instance by Agent API. those instances still appear in ui.

flypenguin commented 8 years ago

okay, I have now tried to work with this for a while, and I still very much think the approach has "room for improvement".

I really propose that you should be able to use any agent to register and de-register services, and this information should just be forwarded to the agent in question, who is then still responsible for doing this. In the not-so-unlikely use case when there is no consul on the node with the service in question, a hashing algorithm could be used to explicitly determine the "responsible" node. in any case it should not be the node where the AP Irequest comes in, because that node is arbitrary, and arbitrary is evil.

simple example: I use a single URL to register/de-register services, and behind this URL is a load balancer which balances between several consul instances for HA. very simple approach, easy to do, and ... not working with the current setup.

my main three lines of thinking are:

first, if I use only one consul behind this URL, I now have a "single node of failure" - if that node dies all my services die with it, right? that is confusing, unneccessary, and counterproductive IMHO.
second, if I do load balance, then I have arbitrary failure nodes, so that if one of them drops I get random service outages.
third, I can't register services on nodes without a consul on them if I have to use the consul on the service node in question, which I also don't like.

I really think there are good reasons to switch to a more sophisticated service registration management.

lowzj commented 8 years ago

@Keets2016 you should use Agent Deregister API to deregister the service instance on the same consul agent that your service instance registered.

flypenguin commented 8 years ago

@lowzj yes and that's exactly what should change. Consul is a distributed system but can't be used in a distributed way.

mdirkse commented 8 years ago

Agreed, unless I'm missing something it's kinda crazy that it works like this.

lowzj commented 8 years ago

@flypenguin yeah, at first I was totally confused by the mechanism that consul using, and spent a lot of time on this. IMO, a service should be able to be deregistered by using the same way if it can be registered by using 'catalog api`.

But consul is much different from other system I used before, for example zookeeper. It's more like SmartStack, but simpler. Consul server like zookeeper at SmartStack, a key/value distributed store system. And consul client like Nerve at SmartStack, manage local services, perform health checks and report infos to the consul server/zookeeper. So why I cannot de-register a service from consul using 'catalog api'? It just like that I directly delete an service(represented by ephemeral znode) from zk, but Nerve will recreate the ephemeral znode because of the health checking of the service is succeed.

Why consul works in this different way? From my experience, I think the biggest reason may be that consul is much easier and more convenient for services to use. From a service's point of view, the local consul client is God, everything could be done through it, and don't care anything else. But it's a little difficult to maintain the whole consul cluster(one agent per node). We must write tools to deploy/monitor/auto-restart consul agents. Yes, I also wrote a shell scripts to deregister critical zombie services.-_-!!

Sorry for my poor English, please don't mind.

flypenguin commented 8 years ago

ah you misunderstood me - I don't actually care about which API to use. that is ok for me.

I case about that you have to use the SAME HOST for registering and de-registering, otherwise it won't work as expected it seeems.

This should be changed.

ghost commented 8 years ago

I agree with @flypenguin 's point of view about infrastructure. I had 2 system outages because of kernel panics, because the services were not deregistered on the machine that died, and I couldn't deregister them until I brought that machine back online.

vtahlani commented 7 years ago

If i try to register external service(https://www.consul.io/docs/guides/external.html) with Agent API getting error invalid tag DataCenter. Can someone please provide me example of registering external services with agent API.

pocesar commented 7 years ago

just happened to me, after deregister in local agent (on 127.0.0.1) I'm getting that the service is still up and running, even health checks are returning critical 1s after the unregister call. that's really counter intuitive, even less in a local agent that should reflect changes immediately

FRosner commented 7 years ago

tl;dr: When encountering this problem and none of the above solutions are helping you, check which process is registering your service that you deregistered. You can check this by enabling the DEBUG log on the consul client where it gets registered. Try killing it and see if the problem persists.

I just wanted to say that we had the same issue today that a service kept coming back and for us it was actually caused by a Nomad client. After migrating the allocation folders we seemed to have missed one allocation. This could've caused this inconsistency.

This was the setup:

We have a Nomad job A exposing service S. Then we have another Nomad Job B exposing its own service Q. B was special because its allocation folder was (for some reason) on the old location.
After calling nomad stop A, Nomad deregistered service S. But immediately after this, Nomad registered S again without allocating A anywhere.
Looking at the logs we figured that the Nomad executor of B (which was running on another node) registered S on this node (for some weird reason). We identified this by enabling the DEBUG level on the consul clients and checking the process sending the register / deregister event with lsof.
After stopping the Nomad client on the node where B was running, S was no longer deregistered. However when manually deregistering S, it was being registered again from a Nomad executor on the machine where we stopped the Nomad client. After forcefully stopping this executor, cleaning the allocation folder and restarting the Nomad client it works as expected again.

AjitDas commented 7 years ago

This is a horrible and very bad solution provided by consul experts to be able to only register & deregister from same host where the agent is running. This defeats the whole purpose of high availability & resilient architecture. I have same issue where I have 5 consul running behind a AWS ALB/ELB as any other service with docker along with AWS ECS so that it's scalable & highly available with number of AWS tasks and I don't want a consul client or agent to run in every EC2 instances where my applications are running. I have close to thousands of servers running in AWS ECS clusters for many applications and this becomes unmanageable if I have to run a consul client in each of these EC2 instances. I don't want to hardcode the consul ip & port as well. When I use agent APIs via ELB/ALB url then it lands in one of 5 consul instance and works fine but when deregister using load balancer then it has 20% of success chance as it can go to any of 5 consul nodes and with frequent updates to my service deployments too many service ids dangles under a service name and it creates big headache.

This is a must need to be register/deregister from any node, am surprised how come this is not thought through and would love to hear solutions from biggies who uses thousands of EC2 instances for their applications.

slackpad commented 7 years ago

Hi @AjitDas iConsul provides solutions to the problems you mention if you run the agent on each node in your cluster. Consul's not designed to run just as a set of servers behind a load balancer; running it that way means you take on solving these problems yourself. When you run agents on each node and have your applications register themselves, they will sync up the catalog on the servers for you, and health checks can be performed locally on the agent, the results of which will get synced automatically as well. The agents perform checks against each other, forming an efficient failure detector which will arrange to have the catalog cleaned up in the event that a node dies and doesn't deregister itself. Applications only need to talk to their local agent, and Consul will route requests to a healthy server automatically with no load balancer. Many, many folks are successfully running Consul clusters with thousands of nodes in this fashion. Hope that helps!

GreatSnoopy commented 7 years ago

@slackpad but what you say does not cover the situation when for some reason the node that registered the service in the first place is not available any more and cannot be made available again. For this very situation, there should be an option to forcibly deregister a service, even if you must do it on one of the few voting nodes. And yes, it defeats the purpose to have consul as a distributed system and not being able to do some operations on more than one certain node. It is even counter-intuitive to be able to set/unset key-values in the kv store from any node and not be able to do the same with the services. Please change this behavior, it really does not make sense the way it is managed now.

slackpad commented 7 years ago

@GreatSnoopy if the node is indeed gone from the cluster then you have two ways to remove a stale service from one of the remaining nodes (it will get cleaned up by Consul after 72 hours if you don't do anything):

Use the consul force-leave command or Agent API to immediately remove the node from the cluster, which will remove its associated services.
Use the Catalog API to remove the services. If the node is gone then they will not be re-registered automatically (the agent on the node is what causes that).

mritd commented 7 years ago

Same problem here.

Consul v0.9.3

dataviruset commented 7 years ago

Have a similar problem here on Consul 1.0.0. A service ('problematic-service') keeps coming back so the Consul client is deregistering it all the time, but it doesn't go away and is still visible from the other nodes.

nov 09 00:15:44 myserver consul[2866]: 2017/11/09 00:15:44 [INFO] agent: Synced service 'a-service'
nov 09 00:15:44 myserver consul[2866]: agent: Synced service 'a-service'
nov 09 00:15:44 myserver consul[2866]: agent: Deregistered service 'problematic-service'
nov 09 00:15:44 myserver consul[2866]: 2017/11/09 00:15:44 [INFO] agent: Deregistered service 'problematic-service'

EDIT: I discovered that I had a node_name conflict. Fixed that, removed the serf folders, executed force-leave on the Consul servers seems to have fixed the problem.

alexeyknyshev commented 6 years ago

So is there way to purge service that was registered wrong way (say, via agent on another node & I don't know which one actually)?

webertrlz commented 6 years ago

best way is to keep track of what agent was used to register on your application, and use the same to deregister.

upon node failure, the application should use the catalog api to deregister services.

the main problem is when a node serving both agent & applications die, then such applications won't deregister, and one must have another application to take care of healthcheck + deregistering using the catalog api

slackpad commented 6 years ago

best way is to keep track of what agent was used to register on your application, and use the same to deregister.

This is true - the local agent where it was registered using configs or /v1/agent APIs should be used to deregister it.

the main problem is when a node serving both agent & applications die, then such applications won't deregister, and one must have another application to take care of healthcheck + deregistering using the catalog api

This should not be necessary. The serfHealth check for the dead node will fail within a few seconds, effectively marking all of the services there offline. Consul will automatically clean up the catalog in 72 hours if the node doesn't come back.

hashicorp / consul

Unable to deregister a service #1188