influxdata / influxdb

Scalable datastore for metrics, events, and real-time analytics
https://influxdata.com
Apache License 2.0
28.97k stars 3.56k forks source link

Influxdb crashes after 2 hours #7640

Closed rubycut closed 7 years ago

rubycut commented 7 years ago

Bug report

System info: Influxdb 1.1.0 Os: Debian

Steps to reproduce:

  1. Start influxdb
  2. Create 500 different databases with 500 users
  3. Auth should be on

Expected behavior:

Should run normally

Actual behavior:

Crashes after two hours

rubycut commented 7 years ago

@jwilder, after I changed shard group duration, number of shards is decreasing little by little every day in last two weeks. Now, sudden crashes of influxdb are not happening so often.

It's quite possible that when thousands of these 24 hours shards expire, that influxdb will be not be crashing anymore.

weshmashian commented 7 years ago

Upgrade to influx 1.2.0-rc1 fixed the memory leak issue we were having. We've also dropped ~250 subscriptions which brought the number of goroutines down to a reasonable level.

However, Influx still crashes after 90-180 minutes of uptime and it seems it's not related to usage (write queries vs mixed queries).

jwilder commented 7 years ago

@rubycut Has stability improved since lowering the number of shards?

rubycut commented 7 years ago

@jwilder , we still have crashes, we are running 1.2.0-rc1, my colleague @weshmashian is running tests and trying various stuff, number of shards decreased significantly, since we are running 14 day shard groups since mid Dec, but we still have crashes, they usually crash after few hours, the longest period we were able to run it without crash is 48 hours.

@weshmashian can provide details.

weshmashian commented 7 years ago

We're actually running three versions - 1.1.0, 1.2.0-rc1 and 1.2.0. So far rc1 has had no crashes in past three days. We're trying to get the same behavior on 1.2.0 (matching configs and using the same dataset), but so far it was not successful.

We've also started getting the following panic on one 1.2.0-rc1:

panic: count of short block: got 0, exp 1

goroutine 62183 [running]:
panic(0x9cd1e0, 0xc6b6936eb0)
        /home/influx/.gvm/gos/go1.7.4/src/runtime/panic.go:500 +0x1a1
github.com/influxdata/influxdb/tsdb/engine/tsm1.BlockCount(0x0, 0x0, 0x0, 0x8)
        /home/influx/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/encoding.go:224 +0x315
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*tsmKeyIterator).combine(0xc5b67aca00, 0x0, 0xa763a0, 0x1, 0xc7b9cbfbc0)
        /home/influx/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:1165 +0x6eb
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*tsmKeyIterator).merge(0xc5b67aca00)
        /home/influx/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:1101 +0x100
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*tsmKeyIterator).Next(0xc5b67aca00, 0xc813bb36ea)
        /home/influx/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:1068 +0x2bd
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Compactor).write(0xc5ae0d8b40, 0xc5bb7f1db0, 0x47, 0xe2c620, 0xc5b67aca00, 0x0, 0x0)
        /home/influx/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:774 +0x29b
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Compactor).writeNewFiles(0xc5ae0d8b40, 0x22f, 0x6, 0xe2c620, 0xc5b67aca00, 0xe2c620, 0xc5b67aca00, 0x0, 0x0, 0xc5ca243bf8)
        /home/influx/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:728 +0x284
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Compactor).compact(0xc5ae0d8b40, 0xba8400, 0xc5bb7f5300, 0x2, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0)
        /home/influx/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:654 +0x52a
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Compactor).CompactFull(0xc5ae0d8b40, 0xc5bb7f5300, 0x2, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0)
        /home/influx/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:672 +0x18e
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*compactionStrategy).compactGroup.func1(0xc5b4dcefc0, 0xc5bb7f5300, 0x2, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0)
        /home/influx/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:1072 +0x15d
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*compactionStrategy).compactGroup(0xc5b4dcefc0, 0x0)
        /home/influx/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:1074 +0x47d
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*compactionStrategy).Apply.func1(0xc5b7551f80, 0xc5b4dcefc0, 0x0)
        /home/influx/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:1047 +0x5d
created by github.com/influxdata/influxdb/tsdb/engine/tsm1.(*compactionStrategy).Apply
        /home/influx/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:1048 +0xcc

Full version: 1.2.0~rc1, branch master, commit bb029b54447fc2806af00fe1cddd44146476d3f1

e-dard commented 7 years ago

@weshmashian how often does that panic occur? Has it happened in the final 1.2.0 release?

weshmashian commented 7 years ago

@e-dard it happened on final 1.2.0 as well:

panic: count of short block: got 0, exp 1

goroutine 53530 [running]:
panic(0x9d12c0, 0xc6ad599740)
        /usr/local/go/src/runtime/panic.go:500 +0x1a1
github.com/influxdata/influxdb/tsdb/engine/tsm1.BlockCount(0x0, 0x0, 0x0, 0x8)
        /root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/encoding.go:224 +0x315
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*tsmKeyIterator).combine(0xc4a7675b00, 0x0, 0xa7a980, 0x1, 0xc7086c8c00)
        /root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:1165 +0x6eb
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*tsmKeyIterator).merge(0xc4a7675b00)
        /root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:1101 +0x100
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*tsmKeyIterator).Next(0xc4a7675b00, 0xc613bb36ea)
        /root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:1068 +0x2bd
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Compactor).write(0xc5183ba050, 0xc57fbdbdb0, 0x47, 0xe30720, 0xc4a7675b00, 0x0, 0x0)
        /root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:774 +0x29b
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Compactor).writeNewFiles(0xc5183ba050, 0x22f, 0x6, 0xe30720, 0xc4a7675b00, 0xe30720, 0xc4a7675b00, 0x0, 0x0, 0xc57cba1bf8)
        /root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:728 +0x284
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Compactor).compact(0xc5183ba050, 0xbacc00, 0xc570c61aa0, 0x2, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0)
        /root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:654 +0x52a
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Compactor).CompactFull(0xc5183ba050, 0xc570c61aa0, 0x2, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0)
        /root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:672 +0x18e
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*compactionStrategy).compactGroup.func1(0xc56eba0930, 0xc570c61aa0, 0x2, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0)
        /root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:1072 +0x15d
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*compactionStrategy).compactGroup(0xc56eba0930, 0x0)
        /root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:1074 +0x47d
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*compactionStrategy).Apply.func1(0xc57cac3380, 0xc56eba0930, 0x0)
        /root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:1047 +0x5d
created by github.com/influxdata/influxdb/tsdb/engine/tsm1.(*compactionStrategy).Apply
        /root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:1048 +0xcc
[I] 2017-01-27T08:27:13Z InfluxDB starting, version 1.2.0, branch master, commit b7bb7e8359642b6e071735b50ae41f5eb343fd42

I've synced known good datadir and started up rc1 on it. It's not panicking any more, but it's stil too early to tell if it's going to crash.

Updated panic in previous comment as I only now realized it was missing the first line.

e-dard commented 7 years ago

@weshmashian thanks. We need to open a new issue for this panic.

jwilder commented 7 years ago

Should be fixed via #8348

mahaveer1707 commented 5 years ago

My influx version is 1.5.2-1 I read this whole thread and only understood to change the duration of shards , but it is no help.

My influx instance goes down every 1 hour or so. and this has started happening just 2-3 weeks before. I have 35 databases . Retention policy of 1 month to each and shard duration is 2 weeks.

Attached some files for details

uname.txt diagnostics.txt iostat.txt shards.txt stats.txt