hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.26k stars 4.41k forks source link

Acquiring locks from remote DCs fails (regression) #6828

Open rhyas opened 4 years ago

rhyas commented 4 years ago

Feature Description

Allow Locks to be functional across regions.

(Why is there a -datacenter argument to the lock command at all if this doesn't work??)

Use Case(s)

Job Scheduling by services in multiple datacenters that should only run in a single datacenter at a time. Sequential startup of cluster technologies that need to use a shared lock so as to avoid split-brain scenarios.

crhino commented 4 years ago

Hi there, looks like this is a continuation of #5373. As @banks mentioned there, this is outside of the design decisions Consul has made.

FWIW, this used to work in 1.1.x.

You mentioned this in that ticket, if you could expand upon what you did in 1.1.x to make this work that would be useful context to have.

rhyas commented 4 years ago

It might have been pre-1.1, I'd have to stand up a multi-region cluster to test and confirm the exact version. The command we used to use was:

consul lock -datacenter us-east-2 -monitor-retry 10 -timeout 6h -try 6h /locks/clusterapp/$cluster_name /usr/local/bin/provision.sh

This would be done from, for example, us-west-2, and the agent would be registered in the us-west-2 datacenter. Design decision or not, unfortunately it's a breaking change for us the way Consul decided to go. I'd love to know if there's an alternative, because we made use of this in the provisioning side, and were looking at using it a lot more for simple leader semantics.

Can anyone speak to why the -datacenter argument is even still in the code and available if this is never intended to work again?

banks commented 4 years ago

@rhyas thanks for clarifying with the example - it makes all the difference.

The other issue was asking about a mechanism to cross DC locks that rely on Consul sessions and health checks across DCs.

The "design decision" that makes that not possible is that each DC only knows about it's locally registered services and so you can't have a lock on one DC that is held by a node in another DC and rely on the serf health mechanism to release it on node failure.

The example you gave should work because the consul lock command explicitly creates it's own session not tied to a node in the catalog, and uses a TTL-based heartbeat to keep that session alive.

That does work across DCs but notice that it's not quite the same as "cross DC locking" in general because your locking calls are explicitly heartbeating over the WAN (forwarded by local servers) rather than relying on some mechanism to keep lock and health state consistent between DCs.

Does that make sense? Apoligies if that was confusing from the other issue.

Given that clarification, have you actually observed consul lock not working across datacenters in recent versions or was this posted just based on the comments in the other issue?

If you have actually seen a change in behaviour between version, please add a little more detail about which versions you do/don't see the change in.

To my knowledge nothing should have changed that would cause this regression - the datacenter field is handled the same way in our API client and RPCs for everything in Consul and we've not touched KV or Lock handling that I can think of specifically.

If you can confirm this actually no longer works as you expect in a recent version of Consul I'll mark it as a bug and we can take a look and try to reproduce but if it's just a confusion from the other issue then I don't think there is actually a regression here.

Thanks!

rhyas commented 4 years ago

Yes, we have observed it being broken. The error/responses indicate it's a registration/session issue, which is the only reason I would link it to the other issue. Here's what we get:

root@testme:~# curl -s http://169.254.169.254/latest/dynamic/instance-identity/document | jq -r .region
us-east-2
root@testme:~# consul lock -datacenter us-west-2 -monitor-retry 10 -timeout 6h -try 6h /locks/clusterapp/blee sleep 10
Lock acquisition failed: failed to create session: Unexpected response code: 500 (rpc error making call: rpc error making call: rpc error making call: Missing node registration)

We're running Consul v1.6.0 at the moment on both the client and the server. The cluster does have wan links with all nodes reporting healthy.

It'll take me a few days to get time to stand up and older version and find where it worked/didn't, but I can try to get that done and I'll update this thread.

banks commented 4 years ago

Great, thanks for confirming that. It should be easy enough to repro.

I've edited the issue title to make sure it's clear and we don't get confused again with that.

mirkof commented 4 years ago

This was really an unexpected behaviour. I've tested with this docker compose.

version: '3'

services:

  dc1node1: &consul-server
    image: "consul:latest"
    networks:
      - consul-demo
    ports:
      - 8511:8500
    command: "agent -datacenter=dc1 -node=node1 -server -client 0.0.0.0 -retry-join=dc1node2 -retry-join-wan=dc2node1"

  dc1node2:
    <<: *consul-server
    ports:
      - 8512:8500
    command: "agent -datacenter=dc1 -node=node2 -server -client 0.0.0.0 -retry-join=dc1node1 -retry-join-wan=dc2node1"

  dc2node1:
    <<: *consul-server
    ports:
      - 8521:8500
    command: "agent -datacenter=dc2 -node=node1 -server -bootstrap -client 0.0.0.0 -retry-join-wan=dc1node1 -retry-join-wan=dc1node2"

networks:
  consul-demo:

And these are the results: curl -X PUT http://localhost:8511/v1/session/create?dc=dc1 ✔️ curl -X PUT http://localhost:8512/v1/session/create?dc=dc1 ✔️ curl -X PUT http://localhost:8521/v1/session/create?dc=dc1 ✔️

curl -X PUT http://localhost:8511/v1/session/create?dc=dc2 ✔️ curl -X PUT http://localhost:8512/v1/session/create?dc=dc2curl -X PUT http://localhost:8521/v1/session/create?dc=dc2 ✔️

It seems that this works only if in other DC there's a node with the same name. Probably because Session requires an already registered node.

kyrias commented 4 months ago

Are there any plans for fixing this regression? It feels a bit weird that sessions with manual renewal have to be associated with a specific registered node at all.