hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.25k stars 4.41k forks source link

Invalid value for X-Consul-Index when querying folder #468

Closed webcoyote closed 9 years ago

webcoyote commented 9 years ago

I noticed a bug that causes my application to have high CPU utilization because it continuously polls consul, which is returning a zero value for X-Consul-Index.

I'm running the latest consul (64-bit on Windows)

$ consul -v
Consul v0.4.1
Consul Protocol: 2 (Understands back to: 1)

BUG: I query a folder and get X-Consul-Index=0

$ curl -i http://localhost:8500/v1/kv/match/disable/?keys
HTTP/1.1 200 OK
X-Consul-Index: 0 <<<<<<<<<<<<<<<<<<<<<< BUG
X-Consul-Knownleader: true
X-Consul-Lastcontact: 0

["match/disable/1.2.3.4","match/disable/4.4.4.4","match/disable/4.5.6.7","match/disable/"]

I query the individual keys and get valid results

$ curl -i http://localhost:8500/v1/kv/match/disable/1.2.3.4
HTTP/1.1 200 OK
X-Consul-Index: 16091
X-Consul-Knownleader: true
X-Consul-Lastcontact: 0

[{"CreateIndex":16091,"ModifyIndex":16091,"LockIndex":0,"Key":"match/disable/1.2.3.4","Flags":0,"Value":null}]

$ curl -i http://localhost:8500/v1/kv/match/disable/4.4.4.4
HTTP/1.1 200 OK
X-Consul-Index: 16352
X-Consul-Knownleader: true
X-Consul-Lastcontact: 0

[{"CreateIndex":16352,"ModifyIndex":16352,"LockIndex":0,"Key":"match/disable/4.4.4.4","Flags":0,"Value":""}] 

$ curl -i http://localhost:8500/v1/kv/match/disable/4.5.6.7
HTTP/1.1 200 OK
X-Consul-Index: 16097
X-Consul-Knownleader: true
X-Consul-Lastcontact: 0

[{"CreateIndex":16097,"ModifyIndex":16097,"LockIndex":0,"Key":"match/disable/4.5.6.7","Flags":0,"Value":null}]

I deleted a key in a totally unrelated folder and that fixed the problem:

$ curl -i http://localhost:8500/v1/kv/match/disable/?keys
HTTP/1.1 200 OK
X-Consul-Index: 16890 <<<<<<<<<<<<<<<<<< VALID
X-Consul-Knownleader: true
X-Consul-Lastcontact: 0

["match/disable/1.2.3.4","match/disable/4.4.4.4","match/disable/4.5.6.7","match/disable/"]

Is this expected behavior or is my application "doing it wrong". My goal is to long poll query a list of keys (I don't need values) to discover which services are disabled, so I believed it would be reasonable to perform a "key" query.

Note that I've simplified the above results by removing the Date, Content-Type and Content-Length headers, but they were all legitimate values.

armon commented 9 years ago

Looks like a bug! I will look into it

armon commented 9 years ago

Are ACLs enabled? Could you share the configuration?

webcoyote commented 9 years ago

ACLs are not enabled. Here's the script to run consul:

GOMAXPROCS=2 "$SCRIPT_DIR/../../bin/consul" agent             \
  -bind=127.0.0.1                     \
  -bootstrap                          \
  -server                             \
  -config-dir "$SCRIPT_DIR/consul.d"  \
  -data-dir "$SCRIPT_DIR/data"        \
  -ui-dir "$SCRIPT_DIR/ui"            \
#

Configuration directory contains one file:

{
  "service": {
    "name": "MatchSrv",
    "tags": ["game"],
    "port": 9001,
    "check": {
        "name": "MatchSrv status",
        "ttl": "15s"
    }
  }
}
armon commented 9 years ago

@webcoyote I just pushed 8a1969cc8cf4585438630e051e6fcc24fc7e908f, which should fix this. Do you think you could give it a try with a build from master?

armon commented 9 years ago

@webcoyote Have you had a chance to try again on the master build?

webcoyote commented 9 years ago

Thanks for your help building consul. I'm running master with 8a1969c and haven't encountered any problems so far. Unfortunately I've been unable to recreate the bug with the old version (4.1) either -- it was a one-time occurrence.

Incidentally, when I ran the new version it was necessary to delete the data folder to prevent "==> Error starting agent: Failed to start Consul server: Failed to start Raft: MDB_INVALID: File is not an MDB file". Is that expected behavior?

armon commented 9 years ago

Hmm that is definitely not expected behavior. You experienced this from upgrading from 0.4.1 to the master build? That is super strange. Was the 0.4.1 an official build?

webcoyote commented 9 years ago

Yes, I was using the 0.4.1 official build for Windows. Note that the MDB_INVALID error is bidirectional:

rm data-directory; run old-consul; run new-consul => error
rm data-directory; run new-consul; run old-consul => error
armon commented 9 years ago

Awesome, thanks for the heads up. I'll need to investigate a possible regression.

armon commented 9 years ago

@webcoyote Can you confirm the new versions were never running at once? Working with the LMDB developer, I think the issue is upstream. Apparently concurrent processes can break this.

hyc commented 9 years ago

Looks like you used two different LMDB versions and the offset of the version or magic number changed in the meta page.

Has nothing to do with maxreaders setting.

armon commented 9 years ago

@hyc We've been pinned to 0.9.11 for quite some time (May 30th). Not sure why this would be happening now that we touched that setting.

Does this mean upgrading to LMDB 1.0 will also break all existing installs?

webcoyote commented 9 years ago

Can you confirm the new versions were never running at once?

I was only running one at a time.

hyc commented 9 years ago

@armon - should not have happened, but changing compile environment might have affected it.

LMDB 1.0 will break 0.9 compatibility, yes. The on-disk data layout will change. It has not (intentionally) changed so far.

armon commented 9 years ago

@hyc Is there a compatibility promise with the 1.0 release? How do you guys handle the upgrade process for OpenLDAP?

hyc commented 9 years ago

All minor versions within a major version will be compatible. I doubt the disk format will change often. We're changing it in 1.0 to support incremental backup. There aren't any other new features envisioned that will require further changes.

hyc commented 9 years ago

Meanwhile, I'd be curious to see why your windows builds are having this problem, if all your builds are on 0.9.11

armon commented 9 years ago

@webcoyote Do you think you could provide @hyc with copies of the raft/ data directory after running it with each version? That may be of use in determine what changed

hyc commented 9 years ago

re: handling changes in OpenLDAP - the docs already say to use slapcat/slapadd to migrate btw versions. We added mdb_dump/mdb_load in 0.9.14 to handle migrations in LMDB.

armon commented 9 years ago

@hyc I don't see any docs for those two here: http://symas.com/mdb/doc/group__mdb.html. Also seems the docs are for 0.9.14 not 1.0. Where is the best place to look?

armon commented 9 years ago

Never mind, I see its a CLI tool and not an API

webcoyote commented 9 years ago

Do you think you could provide @hyc with copies of the raft/ data directory after running it with each version? That may be of use in determine what changed.

Yes. Since he doesn't have a public email address I'll mail it to you, @armon.

armon commented 9 years ago

@webcoyote It looks like you must have built on a 64bit platform. Our official Windows builds are 32bit, so it looks like the mis-match there was causing the issue. (Thanks to @hyc for diagnosing).

Looks like we can close this one down!