wrong query results after a node rejoin the cluster

kylezh commented 9 years ago

Way to reproduce the problem:

Start a cluster with nodes A,B,C
Write one record.
Kill influxdb in nodeA. C becomes leader.
Restart influxdb in nodeA. It rejoins the cluster as follower.
Write a second record.

The second record could only be selected from nodeB and nodeC, but not nodeA. Query result from nodeA only contains the first record.

otoolep commented 9 years ago

Please show all curl commands required to reproduce this issue, including creation of the database and retention policy.

On Feb 28, 2015, at 2:15 AM, "Kyle Zhang(zelin.io)" notifications@github.com wrote:

Way to reproduce the problem:

Start a cluster with nodes A,B,C Write one record. Kill influxdb in nodeA. C becomes leader. Restart influxdb in nodeA. It rejoins the cluster as follower. Write a second record. The second record could only be selected from nodeB and nodeC, but not nodeA. Query result from nodeA only contains the first record.

— Reply to this email directly or view it on GitHub.

oliveagle commented 9 years ago

I don't know if my problem is the same as @kylezh .

I'm running a single influxdb instance within a docker container. no volume and with static hostname,

     docker run -d -p 80:80 -p 8083:8083 -p 8084:8084 -p 8086:8086 -p 9022:22 --name="influxdb" --hostname="influxdb" grafana_influxdb /usr/bin/supervisord

config.toml

bind-address = "0.0.0.0" 
reporting-disabled = false
[initialization]
join-urls = ""

[authentication]
enabled = false

[admin]
enabled = true
port = 8083

[api]
[[graphite]]
enabled = false

[collectd]
enabled = false

[udp]
enabled = false

[broker]
dir = "/tmp/influxdb/development/raft"
port = 8086

[data]
  dir = "/tmp/influxdb/development/db"
  port = 8086

  retention-check-enabled = false
  retention-check-period = "10m"

[cluster]
dir = "/tmp/influxdb/development/state"

[logging]
file = "/var/log/influxdb/influxd.log"

initially everything was fine. created database metrics. and retentionpolicy p1. data can write into influxdb.

root@influxdb:/opt/influxdb# ./influx
InfluxDB shell 0.9.0-rc7
Connected to http://localhost:8086 version 0.9.0-rc7
> show databases
name    tags    name
----    ----    ----
        metrics
> use metrics
Using database metrics
> select Count(Value) from metric1.1
name        tags    time            Count
----        ----    ----            -----
metric1.1       1970-01-01T00:00:00Z    46
>

2015/03/10 17:33:57 [1970-01-01T00:00:00Z 38]         # NOTICE, results of query: "select Count(Value) from metric1.1"
2015/03/10 17:33:57 Fast-PING: response_time: 4ms, influxdb version: 0.9.0-rc7, queue length: 0, buf size: 100
2015/03/10 17:33:58 Fast-PING: response_time: 2ms, influxdb version: 0.9.0-rc7, queue length: 0, buf size: 100
 <- tick ----------------------
2015/03/10 17:33:58  * md len:100 [influxdb] consuming: boltQ len: 0 , mdCh len: 0, buf size: 0
2015/03/10 17:33:58 Fast-PING: response_time: 5ms, influxdb version: 0.9.0-rc7, queue length: 0, buf size: 100
2015/03/10 17:33:59 Fast-PING: response_time: 2ms, influxdb version: 0.9.0-rc7, queue length: 0, buf size: 100
 <- tick ----------------------
2015/03/10 17:33:59 Fast-PING: response_time: 2ms, influxdb version: 0.9.0-rc7, queue length: 1, buf size: 0
2015/03/10 17:34:00 Fast-PING: response_time: 1ms, influxdb version: 0.9.0-rc7, queue length: 1, buf size: 0
 <- tick ----------------------
2015/03/10 17:34:00  - md len:200 [influxdb] backfilling:, boltQ len: 0
2015/03/10 17:34:00 [1970-01-01T00:00:00Z 41]
2015/03/10 17:34:00 Fast-PING: response_time: 5ms, influxdb version: 0.9.0-rc7, queue length: 0, buf size: 100
2015/03/10 17:34:01 Fast-PING: response_time: 2ms, influxdb version: 0.9.0-rc7, queue length: 0, buf size: 100
 <- tick ----------------------
2015/03/10 17:34:01  - md len:100 [influxdb] backfilling:, boltQ len: 0
2015/03/10 17:34:01 [1970-01-01T00:00:00Z 43]         #NOTICE: increasing.

But after I _kill container and start it again_. everything seems to work except I _cannot write data into influxdb anymore_.

2015/03/10 17:35:51 [1970-01-01T00:00:00Z 46]      # NOTICE: this number won't increase.
2015/03/10 17:35:51 Fast-PING: response_time: 4ms, influxdb version: 0.9.0-rc7, queue length: 8, buf size: 0
2015/03/10 17:35:52 Fast-PING: response_time: 1ms, influxdb version: 0.9.0-rc7, queue length: 8, buf size: 0
 <- tick ----------------------
2015/03/10 17:35:52  - md len:100 [influxdb] backfilling:, boltQ len: 7
2015/03/10 17:35:52 [1970-01-01T00:00:00Z 46]
2015/03/10 17:35:52 Fast-PING: response_time: 11ms, influxdb version: 0.9.0-rc7, queue length: 7, buf size: 100
2015/03/10 17:35:53 Fast-PING: response_time: 2ms, influxdb version: 0.9.0-rc7, queue length: 7, buf size: 100
 <- tick ----------------------
2015/03/10 17:35:53  * md len:100 [influxdb] consuming: boltQ len: 7 , mdCh len: 0, buf size: 0
2015/03/10 17:35:53  - md len:200 [influxdb] backfilling:, boltQ len: 6
2015/03/10 17:35:53 [1970-01-01T00:00:00Z 46]
2015/03/10 17:35:53 Fast-PING: response_time: 6ms, influxdb version: 0.9.0-rc7, queue length: 6, buf size: 100
 <- tick ----------------------
2015/03/10 17:35:54 Fast-PING: response_time: 7ms, influxdb version: 0.9.0-rc7, queue length: 5, buf size: 121
2015/03/10 17:35:54  - md len:200 [influxdb] backfilling:, boltQ len: 5
2015/03/10 17:35:54 [1970-01-01T00:00:00Z 46]

and what's more. _no error raised_:

func   writeToInfluxdb(XXXX) {
        // ...

    res, err := w.cli.Write(write)
    if err != nil {
        // log.Println(res, err)
        //TODO: remove this log.Println
        log.Println(" -E- writeMD failed: ", err)
        return err
    }
    if res != nil && res.Err != nil {
        log.Println(" -E- writeMD failed: res.Err: ", res.Err)
        return fmt.Errorf("res.Err: %s", res.Err)
    }
    return err
}

write data with web GUI also don't work.

beckettsean commented 9 years ago

please provide a repro case with latest RC

influxdata / influxdb

wrong query results after a node rejoin the cluster #1791