hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.41k stars 4.43k forks source link

Node protocol version is incompatible #4967

Closed TwitchChen closed 6 years ago

TwitchChen commented 6 years ago

We hava a consul cluster include three server and some nodes, all the consul agent's version is 0.9.2

Node       Address              Status  Type    Build  Protocol  DC
30002  100.68.156.151:8301  alive   client  0.9.2  2         dc1
30003  100.68.156.152:8301  alive   client  0.9.2  2         dc1
30004  100.68.156.153:8301  alive   client  0.9.2  2         dc1
30005  100.68.156.154:8301  alive   client  0.9.2  2         dc1
30006  100.68.156.155:8301  alive   client  0.9.2  2         dc1
30007  100.68.156.156:8301  alive   client  0.9.2  2         dc1
30008  100.68.156.157:8301  alive   client  0.9.2  2         dc1
30009  100.68.156.158:8301  alive   client  0.9.2  2         dc1
30010  100.68.156.159:8301  alive   client  0.9.2  2         dc1
30011  100.68.156.161:8301  alive   client  0.9.2  2         dc1
30012  100.68.156.160:8301  alive   client  0.9.2  2         dc1
30021  100.68.156.170:8301  alive   client  0.9.2  2         dc1
30022  100.68.156.171:8301  alive   client  0.9.2  2         dc1
30023  100.68.156.172:8301  alive   client  0.9.2  2         dc1
30024  100.68.156.173:8301  alive   client  0.9.2  2         dc1
30264  100.68.156.174:8301  alive   client  0.9.2  2         dc1
30172  100.68.152.1:8301    alive   server  0.9.2  2         dc1
30173  100.68.152.2:8301    alive   server  0.9.2  2         dc1
30174  100.68.152.3:8301    alive   server  0.9.2  2         dc1

Yesterday, I found the cluster was wrong.the node could not get the kv or other thing

    2018/11/15 10:02:09 [WARN] manager: No servers available
    2018/11/15 10:02:09 [ERR] agent: failed to sync remote state: No known Consul servers

I checked the server's log, i found this :

 2018/11/14 04:00:01 [ERR] memberlist: Failed push/pull merge: Node '30264' protocol version (0) is incompatible: [1, 5] from=100.68.156.174:51128
 2018/11/14 04:00:01 [ERR] memberlist: Failed push/pull merge: Node '30264' protocol version (0) is incompatible: [1, 5] from=100.68.156.174:51160

on 2018/11/14 04:00:01, we restarted some nodes, then hte node 30264 protocol version (0) is incompatible.

But I don't know how the problem is generated.

banks commented 6 years ago

Yesterday, I found the cluster was wrong.the node could not get the kv or other thing

Can you describe exactly which node is not working as you expect? Is it every node broken or just one? CAn you show your server and client configs?

2018/11/14 04:00:01 [ERR] memberlist: Failed push/pull merge: Node '30264' protocol version (0) is incompatible: [1, 5] from=100.68.156.174:51128

Hmm that's strange since all the nodes you pasted in the list there say they support protocol version 2. 0 is not even a valid protocol version. I wonder if this is a red-herring and is caused by something like a vulnerability scanner trying to talk to the node on it's gossip port but sending it garbage?

pierresouchay commented 6 years ago

@banks @TwitchChen I suspect that's the exact same bug as https://github.com/hashicorp/consul/issues/3217 -> this is not linked to any specific Consul version, we have it from time to time (especially when there are many elections, such as when upgrading Consul version). It also happens when all agents are in the same exact version.

The fix we use in that case is to restart sequentially all servers, it works all the time for us (we had this up to versions 1.2.x, but we never found the exact root cause).

In that case, what also might happen is that a few agents can see each others, but cannot see all servers. Sometimes restarting those agents do work, but when it does not, restarting all servers sequentially is the only reliable way we found to remove this issue.

banks commented 6 years ago

Thanks @pierresouchay I agree this seems to be the same issue. I'll close this as a duplicate for now since the other is already in our backlog for attention (it's a long backlog sadly!)

Dupe of #3217. Thanks for reporting this @TwitchChen.