hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.31k stars 4.42k forks source link

unexpected end of JSON input after server crash #1221

Closed amir-stratoscale closed 6 years ago

amir-stratoscale commented 9 years ago

After rebooting a server running Consul agent (in server mode), attempting to restart Consul's agent with its previously used parameters throws these errors: ==> Starting Consul agent... ==> Error starting agent: unexpected end of JSON input

Looking in the directories: /checks & /services, I see several files with size 0.

The server's reboot occurred at the same time that we registered health checks to Consul, so I believe that the files are corrupt (empty actually) because they are not being written atomically.

  1. Can the files written to the directories mentioned be written atomically?
  2. As a workaround, is it safe to delete these directories and restart Consul? It seems to work.

Thanks, Amir

slackpad commented 9 years ago

Yeah this could probably done in a little safer manner to reduce the chances of corrupting them. It should be safe to delete them and restart, as long as you re-register the checks since the agent is the source of truth about what's registered.

JeanMertz commented 9 years ago

:+1: for this. We ran into this as well:

==> WARNING: Expect Mode enabled, expecting 3 servers
==> Starting raft data migration...
==> Starting Consul agent...
==> Error starting agent: failed decoding service from /data/services/3d2e6bded046ca2e3f906f04235b279b: unexpected end of JSON input
krestjaninoff commented 8 years ago

The same problem :(

MrMMorris commented 8 years ago

saw this as well with 0.6.0. I just made a config change then restarted consul and a couple of the clients had empty files in services/ that when I deleted and restarted, solved the problem.

MrMMorris commented 8 years ago

This happened about half the times I made changes to node_ttl on a running consul client and then restarted consul. Let me know if I can help debug.

I did see serf: Failed to update snapshot: write /opt/consul/data/serf/local.snapshot: no space left on device before the restart. Not sure if related.

When there was 2 empty files, I actually got a useful error for the second one:

Error starting agent: failed decoding service file "/opt/consul/data/services/4e2b1752dd646044507e9ac846479a90": unexpected end of JSON input

alistanis commented 8 years ago

I have a pull request up for this in #1750

camerondavison commented 8 years ago

This looks like the same problem that nomad had in hashicorp/nomad#1357 which IMHO is a two part solution. Firstly fix the write to be atomic like in #1750 but secondly still allow startup of corrupt state.

alistanis commented 8 years ago

Out of curiosity, did you find something in my implementation that you didn't agree with, and is startup with corrupt state a desirable thing for a system responsible for dns and service discovery? I personally feel like consul's config files need to be as guaranteed as possible.

flegler-zz commented 8 years ago

Not sure if https://github.com/hashicorp/nomad/issues/1367 is the same thing so I am also posting here.

+1 for this fix.

Occasionally I see this issue with Nomad 0.4.0

# /path/nomad agent -config /path/config.json
    Loaded configuration from /path/config.json
==> Starting Nomad agent...
==> Error starting agent: client setup failed: failed to restore state: 2 error(s) occurred:

* failed to decode state: unexpected end of JSON input
* failed to decode state: unexpected end of JSON input

It turns out that some state.json file are empty

# /...path.../nomad version
Nomad v0.4.0
# find /...nomad_directory.../ -name state.json | xargs ls -la
-rw------- 1 root root    0 Aug 11 13:13 /nomad_directory/client/client/alloc/5612e3fc-078d-0a08-049a-b8169d50920c/state.json
-rw------- 1 root root 2760 Aug  8 12:01 /nomad_directory/client/client/alloc/5612e3fc-078d-0a08-049a-b8169d50920c/task-a519f3bf481a344e286bbdab7d3cfe5f/state.json
-rw------- 1 root root    0 Aug 11 13:13 /nomad_directory/client/client/alloc/643ace46-34e0-7861-c26e-43cf5c82a6c1/state.json
-rw------- 1 root root    0 Aug  8 19:52 /nomad_directory/client/client/alloc/643ace46-34e0-7861-c26e-43cf5c82a6c1/task-e71f891226c0be65a6a1d38f954cc4b8/state.json

Deleting /nomad_directory/ helped to overcome the issue.

Not sure what caused this. It might have been the shutdown of the nodes. There was no shortage of disk space.

flegler-zz commented 8 years ago

I just realized that my previous comment was for a similar issue with Nomad. Now I have experienced it with Consul too on 0.6.3.dev.

# /path/consul agent -config-file /path/config.json
==> Starting Consul agent...
==> Error starting agent: failed decoding service file "/path/services/18501c56aa3648940425d400c66dad9a": unexpected end of JSON input

# ls -la /path/services/18501c56aa3648940425d400c66dad9a
-rw------- 1 root root 0 Aug 29 19:50 /path/services/18501c56aa3648940425d400c66dad9a

# /path/consul version
Consul 0.6.3.dev
Consul Protocol: 3 (Understands back to: 1)
zg commented 6 years ago

This issue is still valid in 0.8.5 with an empty JSON file in /var/consul/services:

# consul version
Consul v0.8.5
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)
# cat /var/consul/services/foo
# /usr/bin/consul agent -config-dir=/etc/consul.d -pid-file /tmp/consul.pid
==> WARNING: LAN keyring exists but -encrypt given, using keyring
==> Starting Consul agent...
==> Error starting agent: failed decoding service file "/var/consul/services/foo": unexpected end of JSON input
slackpad commented 6 years ago

Hi @zg did you experience a crash/running out of disk space to get the empty file, or did it occur some other way?

zg commented 6 years ago

I created the tmp file myself as a proof of concept, but previously there was an incident where consul couldn’t start due to zero byte tmp files that were in the data directory that we previously were unaware of.

slackpad commented 6 years ago

I see. I'll kick this back open so we can take a look.

pjelar commented 6 years ago

@zg how did you resolve this? I have an issue where I just pushed a rolling cluster update using kops and one of my consul instances failed to start as it's looking for a service file that doesn't exist.

==> Error starting agent: failed decoding service file "/var/lib/consul/services/a369186d812d2a6ea01b87b6956ef828-340fa39d-14d1-e3f2-1305-d2228422425d.tmp": unexpected end of JSON input
MrMMorris commented 6 years ago

@pjelar delete the empty services file and restart consul

pjelar commented 6 years ago

@MrMMorris there wasn't a file there. I upgraded to 1.0.0 and the problem went away.

zg commented 6 years ago

It could be that an upgrade deletes the tmp files.

pjelar commented 6 years ago

The files weren't there @zg I deleted them. I tried placing valid .json files there too. More likely that 1.0.0 handles this issue better than 0.8.5, it wasn't a normal situation as I also had to do a peers.json manual recovery.

zg commented 6 years ago

That’s how I resolved my issue as well, by deleting the empty JSON files.

nh2 commented 1 year ago

Consul v1.15.3 still does not write the server_metadata.json file atomically, here is the strace showing the straight openat(..., O_TRUNC), and write(), no rename():

openat(AT_FDCWD</>, "/var/lib/consul/server_metadata.json", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0600 <unfinished ...>
<... openat resumed>)     = 7</var/lib/consul/server_metadata.json>
fcntl(7</var/lib/consul/server_metadata.json>, F_GETFL) = 0x8001 (flags O_WRONLY|O_LARGEFILE)
fcntl(7</var/lib/consul/server_metadata.json>, F_SETFL, O_WRONLY|O_NONBLOCK|O_LARGEFILE) = 0
epoll_ctl(3<anon_inode:[eventpoll]>, EPOLL_CTL_ADD, 7</var/lib/consul/server_metadata.json>, {events=EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, data={u32=1970443432, u64=140103803638952}}) = -1 EPERM (Operation not permitted)
fcntl(7</var/lib/consul/server_metadata.json>, F_GETFL) = 0x8801 (flags O_WRONLY|O_NONBLOCK|O_LARGEFILE)
fcntl(7</var/lib/consul/server_metadata.json>, F_SETFL, O_WRONLY|O_LARGEFILE) = 0
write(7</var/lib/consul/server_metadata.json>, "{\"last_seen_unix\":1695257357}", 29) = 29
close(7</var/lib/consul/server_metadata.json>) = 0
soupdiver commented 10 months ago

I also still have the same problem

NathanPB commented 8 months ago

In my case, I rm'd server-metadata.json in my Consul data directory, it was a plain empty file. Later, had to rm all the consul data directory, but not sure if this is related