influxdata / influxdb

Scalable datastore for metrics, events, and real-time analytics
https://influxdata.com
Apache License 2.0
28.99k stars 3.56k forks source link

Can't connect to influxdb - ui on localhost:8083 not available #1164

Closed ezcocos closed 9 years ago

ezcocos commented 10 years ago

Dear All, I have already raised this issue in another thread which got closed without me having any answer to it. I've found it particularly painful to work with influxdb, and I am now on the brink of switching definitively to another time series db, although I would have like to give influxdb a fair chance.

The problem is that after many inserts trials, influxdb is just getting blocked with no possible access: 1) I cannot connect to the database via python: File "/home/ezcocos/dev/python/pyenv2_7/local/lib/python2.7/site-packages/requests/adapters.py", line 407, in send raise ConnectionError(err, request=request) requests.exceptions.ConnectionError: ('Connection aborted.', error(111, 'Connection refused'))

2) I cannot access either the admin tools via localhost:8083 -> page not available

My only solution so far was to reinstall influxdb from scratch, what I will not be able to do in production. My question is, is there a way at least to repair influxdb in order to be able to access it again without completely reinstalling it loosing all the data?

I run influxdb on ubuntu 14.10.

Many thanks in advance for your help. ezcocos

toddboom commented 9 years ago

@ezcocos What version of InfluxDB are you running? The admin interface is now compiled directly into the binary, so this issue should never come up again.

ezcocos commented 9 years ago

Many thanks for your reply. I have the following on my system: /opt/influxdb/versions/0.8.5

toddboom commented 9 years ago

@ezcocos It sounds like you're saying that this only happens after you're writing data for a while. Can you tell me a little bit more about the data that you're writing in (i.e. size, duration, frequency)?

ezcocos commented 9 years ago

It is a one off load of 10 years of daily data of say 8 variables.

v9n commented 9 years ago

@ezcocos @toddboom I also have this issue.

I tried to insert some data every 1 second. Also at the same time I run the Collectd daemon to push data via influxdb-collectd-proxy.

I'm running InfluxDB in a docker container and mount the data directory to host like this:

docker run -v /mnt/influxdb_data:/influxdb_data:rw -d -p 25826:25826/udp -p 8096:8096/udp -p 8086:8086 -p 8083:8083 -p 8081:80 -p 8125:8125/udp -p 8126:8126  grafana

When I have a new docker and if I run docker kill, docker rm to stop and remove old docker container, and start a new docker. Then even influxdb binary is running, I won't be able to access any of HTTP API. Either 8083 or 8086.

I had a feelig that data is crash because if I just clear out everything in /mnt/influxdb_data and start docker again then it worked.

I tried to add -repaid-ldp=true but seems doesn't help

My log file show this:


[2014/12/02 19:30:46 UTC] [INFO] (github.com/influxdb/influxdb/coordinator.(*RaftServer).dropShardsWithRetentionPolicies:537) Checking for shards to drop
[2014/12/02 19:32:32 UTC] [INFO] (main.waitForSignals:24) Received signal: terminated
[2014/12/02 19:32:32 UTC] [INFO] (github.com/influxdb/influxdb/server.(*Server).Stop:263) Stopping server
[2014/12/02 19:32:32 UTC] [INFO] (github.com/influxdb/influxdb/server.(*Server).Stop:272) Stopping admin server
[2014/12/02 19:32:32 UTC] [INFO] (github.com/influxdb/influxdb/server.(*Server).Stop:274) admin server stopped
[2014/12/02 19:32:32 UTC] [INFO] (github.com/influxdb/influxdb/server.(*Server).Stop:276) Stopping raft server
[2014/12/02 19:32:44 UTC] [INFO] (main.setupLogging:69) Redirectoring logging to /var/log/influxdb_log.txt
[2014/12/02 19:32:44 UTC] [INFO] (main.start:164) Starting Influx Server 0.8.6 bound to 0.0.0.0...
[2014/12/02 19:32:44 UTC] [INFO] (github.com/influxdb/influxdb/server.NewServer:43) Opening database at /influxdb_data/db
[2014/12/02 19:32:44 UTC] [INFO] (github.com/influxdb/influxdb/wal.NewWAL:40) Opening wal in /influxdb_data/wal
[2014/12/02 19:32:44 UTC] [INFO] (github.com/influxdb/influxdb/api/http.(*HttpServer).EnableSsl:74) Ssl will be disabled since the ssl port or certificate path weren't set
[2014/12/02 19:32:44 UTC] [INFO] (github.com/influxdb/influxdb/coordinator.(*RaftServer).Serve:566) Initializing Raft HTTP server
[2014/12/02 19:32:44 UTC] [INFO] (github.com/influxdb/influxdb/coordinator.(*RaftServer).Serve:576) Raft Server Listening at 0.0.0.0:8090
[2014/12/02 19:32:44 UTC] [INFO] (github.com/influxdb/influxdb/coordinator.(*RaftServer).startRaft:384) Initializing Raft Server: http://f4fe021fe272:8090
[2014/12/02 19:32:44 UTC] [INFO] (github.com/influxdb/influxdb/coordinator.(*InfluxJoinCommand).Apply:252) Adding new server to the cluster config a0bee89cd150455c
[2014/12/02 19:32:44 UTC] [INFO] (github.com/influxdb/influxdb/cluster.(*ClusterConfiguration).AddPotentialServer:291) Added server to cluster config: 1, http://d98cd1d00f41:8090, d98cd1d00f41:8099
[2014/12/02 19:32:44 UTC] [INFO] (github.com/influxdb/influxdb/cluster.(*ClusterConfiguration).AddPotentialServer:292) Checking whether this is the local server local: f4fe021fe272:8099, new: d98cd1d00f41:8099
[2014/12/02 19:32:44 UTC] [INFO] (github.com/influxdb/influxdb/cluster.(*ClusterConfiguration).AddPotentialServer:301) Added the local server
[2014/12/02 19:32:44 UTC] [INFO] (github.com/influxdb/influxdb/coordinator.(*RaftServer).startRaft:409) Recovered from log
[2014/12/02 19:32:44 UTC] [INFO] (github.com/influxdb/influxdb/server.(*Server).ListenAndServe:96) Waiting for local server to be added
[2014/12/02 19:32:44 UTC] [INFO] (github.com/influxdb/influxdb/wal.(*WAL).SetServerId:109) Setting server id to 1 and recovering
[2014/12/02 19:32:44 UTC] [DEBG] (github.com/influxdb/influxdb/wal.(*WAL).recover:503) Finished wal recovery
[2014/12/02 19:32:46 UTC] [INFO] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftEventHandler:448) (raft:a0bee89cd150455c) Selected as leader. Starting leader loop.
[2014/12/02 19:32:46 UTC] [INFO] (github.com/influxdb/influxdb/datastore.(*ShardDatastore).GetOrCreateShard:158) DATASTORE: opening or creating shard /influxdb_data/db/shard_db_v2/00001
[2014/12/02 19:32:46 UTC] [INFO] (github.com/influxdb/influxdb/cluster.(*ClusterConfiguration).AddShards:1090) Adding shard to default: 1 - start: Thu Nov 27 00:00:00 +0000 UTC 2014 (1417046400). end: Thu Dec 4 00:00:00 +0000 UTC 2014 (1417651200). isLocal: true. servers: [1]
[2014/12/02 19:32:47 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:32:48 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:32:49 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:32:49 UTC] [INFO] (github.com/influxdb/influxdb/server.(*Server).ListenAndServe:112) Sending change connection string command (d98cd1d00f41:8099,f4fe021fe272:8099) (http://d98cd1d00f41:8090,http://f4fe021fe272:8090)
[2014/12/02 19:32:49 UTC] [INFO] (github.com/influxdb/influxdb/datastore.(*ShardDatastore).GetOrCreateShard:158) DATASTORE: opening or creating shard /influxdb_data/db/shard_db_v2/00002
[2014/12/02 19:32:49 UTC] [INFO] (github.com/influxdb/influxdb/cluster.(*ClusterConfiguration).AddShards:1090) Adding shard to default: 2 - start: Thu Sep 6 00:00:00 +0000 UTC 2001 (999734400). end: Thu Sep 13 00:00:00 +0000 UTC 2001 (1000339200). isLocal: true. servers: [1]
[2014/12/02 19:32:50 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:32:51 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:32:52 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:32:53 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:32:54 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:32:55 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:32:56 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:32:57 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:32:58 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:32:59 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:33:00 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:33:01 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:33:02 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:33:03 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:33:04 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:33:05 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:33:06 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:33:07 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:33:08 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:33:09 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:33:10 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:33:11 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:33:12 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:33:13 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:33:14 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:33:15 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:33:16 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:33:17 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:33:18 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:33:19 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:33:20 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:33:21 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:33:22 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:33:23 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:33:24 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:33:25 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:33:26 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:33:27 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:33:28 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:33:29 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:33:30 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:33:31 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:33:32 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:33:33 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:33:34 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:33:35 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:33:36 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:33:37 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:33:38 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:33:39 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:33:40 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:33:41 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:33:42 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:33:43 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:33:44 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:33:44 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).CompactLog:350) Testing if we should compact the raft logs
[2014/12/02 19:33:45 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:33:46 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:33:47 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:33:48 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:33:49 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:33:50 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:33:51 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:33:52 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:33:53 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
[2014/12/02 19:33:54 UTC] [DEBG] (github.com/influxdb/influxdb/coordinator.(*RaftServer).raftLeaderLoop:467) (raft:a0bee89cd150455c) Executing leader loop.
v9n commented 9 years ago

Not sure if this issue relevant to me https://groups.google.com/forum/#!msg/influxdb/5vQwhnXrU-E/78X7LR2UrqUJ I'm on 0.8.6

toddboom commented 9 years ago

@ezcocos @kureikain It sounds like this is related to some other issues where the daemon becomes unresponsive after data is written for a while. Is the API also unresponsive at this point? Could you try changing the default datastore from "rocksdb" to "leveldb" in the configuration file and try reloading your data? Would you be able to share access to the system that this InfluxDB instance is running on?

v9n commented 9 years ago

@toddboom Thanks for quick response.

Is the API also unresponsive at this point? Yes, it's unresponsive. Any HTTP access is unreponsive.

Let me try to switch to leveldb. Yes, I can share access. Can we use teamviewer? Or I can open port on my modem to share you access via SSH? Or in worse case, I can export the whole VirualBox image.

v9n commented 9 years ago

Good new. It's more stable than with leveldb backend now. I'm still testing more and will let you know if leveldb storage fixed this issue.

v9n commented 9 years ago

I think this issue is happen when data is crash or whatever. Because it happens only when I run InfluxDB under heavy write inside a docker container inside a VirtualBox VM. When I tried on EC2, everything is fine. Even if I send kill -9 to influxdb then restart with -repair-ldb, everything is working fine.

toddboom commented 9 years ago

@kureikain Are you finding that the system is still more stable with LevelDB?

v9n commented 9 years ago

@toddboom The system is more stable with LevelDB but I still experience the same issue.

I finally produced 2 data set for you to re-produce this issue.

Here is what I did:

  1. InfluxDB is running.
  2. I take a snapshot of its data. And gzip it call set1_data.tar.gz (Download here http://cl.ly/18103q070R2J)
  3. I wait until some more minutes. I take another snapshot of its data. And gzip is call set2_data.tar.gz. Download here http://cl.ly/011x0t2Z2H1y
  4. Now, if I restart InfluxDB. Even Influxdb process is run, but I cannot acess 8083 (Admin UI) or HTTP API on 8086.
  5. If I removed the data folder, restore from set1_data.tar.gz. Then InfluxDB can runs fine, and admin u and http api responds to request

Note that the whole of my data is just a couple MB, not big at all.

Here is the script I run to generate data point.

start=Time.now
require 'statsd-instrument'
require 'logger'

StatsD.backend = StatsD::Instrument::Backends::UDPBackend.new("192.168.59.103:8125", :statsd)

[
  Thread.new { loop { StatsD.gauge('dev.vinh.test.freemem', `free -m|grep Mem | awk '{print $2}'`.to_i);  StatsD.gauge('dev.vinh.test.heartbeat', Time.now.to_i);sleep 30; } },
  Thread.new { loop { Random.new.rand(1000..3000).times { StatsD.increment('dev.vinh.test.run', 19) }; sleep 15 } }
].each do |t|
  t.run
  t.join
end

I'm running this on a EC2 c3.xlarge.

toddboom commented 9 years ago

@kureikain Thanks for sending all of this over - I'm going to be testing with it this afternoon.

v9n commented 9 years ago

@toddboom Thanks. This bug is hitting us and I have to run a cronjob to backup data every 15 minutes ;(

v9n commented 9 years ago

@toddboom Hey, Todd. If anything I can help to debug this, let me know and I will be doing it. I know a bit of Go so I will be very willing to work/use/build any development code to solve this...

sruon commented 9 years ago

Facing the same issue today, one of the instances in my 3 nodes cluster has influxdb running but not responding over API. Nothing special in the log comes up. Deleting the data folder seem to fix the issue, so I'm guessing something's corrupted.

InfluxDB 0.8.7 on SSDs

ybizeul commented 9 years ago

It looks like I have the same issue. I have a data directory that seem to be corrupted, if I delete it, it works again. Anyone interested in taking a look at it? I only use InfluxDB to store Grafana dashboard, and it looks like it's not even up to that task...

ybizeul commented 9 years ago

Do anyone has a workaround ? I can't think this is happening for everyone, there has to be something that triggers it that is specific to our installations

toddboom commented 9 years ago

@ybizeul It sounds like maybe the daemon isn't fully starting up. Are you able to successfully write data or execute queries when it's in this state?

m1keil commented 9 years ago

We are hitting same issue with older InfluxDB version (0.8.0) and it looks like it is the same in 0.8.8.

If I delete data/raft folder and restart InfluxDB daemon it starts up with 8083 & 8086 ports open. It looks like raft isn't tolerating the changes of hostnames (which happens when you kill/start new containers). I noticed the following message @kureikain's logs and we can find similar thing in our logs as well:

[2014/12/02 19:32:49 UTC] [INFO] (github.com/influxdb/influxdb/server.(*Server).ListenAndServe:112) Sending change connection string command (d98cd1d00f41:8099,f4fe021fe272:8099) (http://d98cd1d00f41:8090,http://f4fe021fe272:8090)

d98cd1d00f41 is the current container, while f4fe021fe272 is the old one (which is no longer exists). Any chance this is due to Raft's election process and the fact that raft cannot work with 2 servers (it needs 3 or more)? So we are hitting some kind of infinite timeout waiting for the second server to come alive while startup?

Just FYI, the issue with hostnames is common problem when moving applications into container, I think ElasticSearch have similar issues unless you make container hostname static.

P.S I'm not sure if my findings are related to @ezcocos issues but if not it's probably should be separate issue.

ghost commented 9 years ago

@m1keil that's an interesting theory. It seems to be random though, sometimes stopping a container and restarting it works, other times it does not. Once it stops working, it's done, the only way I can recover from it is to delete the data.

Our container uses supervisord to manage the influxdb process + other processes. supervisord is PID 1 and we stop the container by using "docker stop" which sends a SIGTERM to supervisord which in turn sends a SIGTERM to the processes it manages. Docker will wait by default for 10 sec. before sending a SIGKILL if the container didn't stop as a result of SIGTERM. We tried increasing the time between SIGTERM and SIGKILL thinking that maybe influx didn't get enough time to shut down gracefully, but that still does not help.

m1keil commented 9 years ago

There's definitely something "random" going on. It's very easy to reproduce this with the following Dockerfile:

FROM phusion/baseimage:0.9.15

RUN curl -O http://s3.amazonaws.com/influxdb/influxdb_0.8.8_amd64.deb && \
    dpkg -i influxdb_0.8.8_amd64.deb && \
    rm -rf influxdb_0.8.8_amd64.deb

RUN mkdir -p /etc/service/influxdb && \
    echo '#!/bin/sh' > /etc/service/influxdb/run && \
    echo 'exec /usr/bin/influxdb -config=/opt/influxdb/shared/config.toml' >> /etc/service/influxdb/run && \
    chmod +x /etc/service/influxdb/run

This will build small container image with InfluxDB and runit as the init system. Runit will monitor Influx's state and auto start it as soon as it is down. To build it: docker build -t influxdb <path to Dockerfile>

Case 1

Run Docker container without static hostname

docker run -t -d --name="influx" -v <local dir>:/opt/influxdb/shared/data influxdb /sbin/my_init This should start container and populate all Influx data in <local dir>.

Now kill the container and start another one again:

$ docker rm -f influx
$ docker run -t -d --name="influx" -v /home/vagrant/influxdata:/opt/influxdb/shared/data influxdb /sbin/my_init

Execute bash process inside the container and inspect its state:

$ docker exec -it influx bash
root@901f13fc06b9:/# ps -ef | grep influx
root       106   102  0 17:38 ?        00:00:00 runsv influxdb
root       107   106  0 17:38 ?        00:00:00 /usr/bin/influxdb -config=/opt/influxdb/shared/config.toml
root       136   120  0 17:39 ?        00:00:00 grep --color=auto influx
root@901f13fc06b9:/#  ss -tln
State       Recv-Q Send-Q                                                          Local Address:Port                                                            Peer Address:Port
LISTEN      0      128                                                                         *:22                                                                         *:*
LISTEN      0      128                                                                        :::8099                                                                      :::*
LISTEN      0      128                                                                        :::8083                                                                      :::*
LISTEN      0      128                                                                        :::8086                                                                      :::*
LISTEN      0      128                                                                        :::22                                                                        :::*
LISTEN      0      128

Now send SIGTERM to influxdb process:

root@901f13fc06b9:/# killall -15 influxdb
.... wait 30 seconds ....
root@901f13fc06b9:/# ss -tln
State       Recv-Q Send-Q                                                          Local Address:Port                                                            Peer Address:Port
LISTEN      0      128                                                                         *:22                                                                         *:*
LISTEN      0      128                                                                        :::22                                                                        :::*
LISTEN      0      128                                                                        :::8090                                                                      :::*

Sometimes it will recover after first SIGTERM, but additional SIGTERM will do the job. That's the random part. It seems like the only way to recover is to delete raft's folder:

root@901f13fc06b9:/# rm -rf /opt/influxdb/shared/data/raft
root@901f13fc06b9:/# killall -15 influxdb
... wait few seconds ..
root@901f13fc06b9:/# ss -tln
State       Recv-Q Send-Q                                                          Local Address:Port                                                            Peer Address:Port
LISTEN      0      128                                                                         *:22                                                                         *:*
LISTEN      0      128                                                                        :::8099                                                                      :::*
LISTEN      0      128                                                                        :::8083                                                                      :::*
LISTEN      0      128                                                                        :::8086                                                                      :::*
LISTEN      0      128                                                                        :::22                                                                        :::*
LISTEN      0      128                                                                        :::8090                                                                      :::*

Case 2

Run Docker container with static hostname

Remove previous running container and delete Influx's data files from <local dir>.

Now repeat the same steps as in Case 1, only now launch container instances with static hostname: $ docker run -t -d --name="influx" --hostname="influxdb" -v /home/vagrant/influxdata:/opt/influxdb/shared/data influxdb /sbin/my_init

You'll see that no matter how much SIGTERM you send in its way, it will successfully recover after few seconds.

entone commented 9 years ago

using fig and/or docker-compose I was running into this same issue, I thought it was a data volume issue initially, but setting the hostname as mentioned fixed the problem.

pauldix commented 9 years ago

This will be fixed in 0.9.0. We won't be making any new releases in the 0.8 line. In the meantime, set the hostname in your config file and it should work across restarts.

bercab commented 9 years ago

If helps someone running docker, we have been able to recover influxdb from this issue by running docker with same original hostname.

In fact, this has been the only way to recover maintaining the old data.

You can extract the original docker hostname (container id) from raft/log binary log on your data dir:

Binary editing the file, you can get something like:

...
{"name":"a4ba0aaf1b324944","connectionString":"http://03fba15f761d:8090","protobufConnectionString":"03fba15f761d:8099"}
...

So the id in *connectionString" is your old docker id.

Recover with:

docker run --hostname=03fba15f761d  ...
beckettsean commented 9 years ago

@gunnaraasen to note re: Docker

aboltart commented 8 years ago

@bercab Thanks, your comment helped

lourot commented 8 years ago

This issue was closed because it was planned to be fixed in 0.9.0 according to @pauldix, but the issue is still there in 0.9.4.2 and I can't find anything related in the changelog. Will this issue be reopened?

beckettsean commented 8 years ago

@AurelienLourot InfluxDB 0.9 is no longer receiving code updates. There will be no fixes to the 0.9.4.2 code base. I would recommend upgrading to InfluxDB 0.9.6 to see if that helps.

Also, please note that all other reports in this issue are for InfluxDB 0.8.x, so it is likely that while your symptoms appear similar it's not actually the same underlying cause. I encourage you to email the mailing list at influxdb@googlegroups.com for assistance.