influxdata / influxdb

Scalable datastore for metrics, events, and real-time analytics
https://influxdata.com
Apache License 2.0
28.62k stars 3.54k forks source link

influxdb out of memory #17055

Open zehuaiWANG opened 4 years ago

zehuaiWANG commented 4 years ago

Hi~ I have some problem and wonder if anyone here could help me. I use the influxdb 1.7.4 .

  1. I had used tsi1 $ cat influxd.log | grep -oP 'index_version=(inmem|tsi1)' | sort | uniq -c 20208 index_version=tsi1

  2. it doesn't seem to have a high mem top - 20:25:14 up 9 days, 12:10, 2 users, load average: 3.95, 3.87, 4.12 Tasks: 320 total, 1 running, 319 sleeping, 0 stopped, 0 zombie Cpu(s): 27.5%us, 4.1%sy, 0.0%ni, 67.8%id, 0.3%wa, 0.0%hi, 0.3%si, 0.0%st Mem: 131387368k total, 129151756k used, 2235612k free, 244604k buffers Swap: 0k total, 0k used, 0k free, 47405776k cached

    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
    24719 influxdb 20 0 763g 98g 22g S 763.0 78.6 621:26.17 influxd

  3. but the influxdb OOM again and again

  4. I found it have panic in the log as following:

    fatal error: runtime: out of memory

runtime stack: runtime.throw(0x130f93b, 0x16) /usr/local/go/src/runtime/panic.go:608 +0x72 runtime.sysMap(0xdd00000000, 0x8000000, 0x231ef38) /usr/local/go/src/runtime/mem_linux.go:156 +0xc7 runtime.(mheap).sysAlloc(0x2305180, 0x8000000, 0x0, 0x7f9f27ffecc0) /usr/local/go/src/runtime/malloc.go:619 +0x1c7 runtime.(mheap).grow(0x2305180, 0x2020, 0x0) /usr/local/go/src/runtime/mheap.go:920 +0x42 runtime.(mheap).allocSpanLocked(0x2305180, 0x2020, 0x231ef48, 0x20373f00000000) /usr/local/go/src/runtime/mheap.go:848 +0x337 runtime.(mheap).alloc_m(0x2305180, 0x2020, 0x410101, 0x7f52a483ef00) /usr/local/go/src/runtime/mheap.go:692 +0x119 runtime.(mheap).alloc.func1() /usr/local/go/src/runtime/mheap.go:759 +0x4c runtime.(mheap).alloc(0x2305180, 0x2020, 0x7f6ff7010101, 0x128) /usr/local/go/src/runtime/mheap.go:758 +0x8a runtime.largeAlloc(0x403f000, 0x7f9f27ff0101, 0x459c4a) /usr/local/go/src/runtime/malloc.go:1019 +0x97 runtime.mallocgc.func1() /usr/local/go/src/runtime/malloc.go:914 +0x46 runtime.systemstack(0x0) /usr/local/go/src/runtime/asm_amd64.s:351 +0x66 runtime.mstart() /usr/local/go/src/runtime/proc.go:1229

goroutine 100938261 [running]: runtime.systemstack_switch() /usr/local/go/src/runtime/asm_amd64.s:311 fp=0xdc35f95138 sp=0xdc35f95130 pc=0x45b890 runtime.mallocgc(0x403f000, 0x10d6680, 0xdcfd9fc701, 0xdc35f95210) /usr/local/go/src/runtime/malloc.go:913 +0x896 fp=0xdc35f951d8 sp=0xdc35f95138 pc=0x40def6 runtime.makeslice(0x10d6680, 0x403f000, 0x403f000, 0xdc35f95330, 0xf769e4, 0xdccae6e6c0) /usr/local/go/src/runtime/slice.go:70 +0x77 fp=0xdc35f95208 sp=0xdc35f951d8 pc=0x444907 bytes.makeSlice(0x403f000, 0x0, 0x0, 0x0) /usr/local/go/src/bytes/buffer.go:231 +0x6d fp=0xdc35f95248 sp=0xdc35f95208 pc=0x4fa3ad bytes.(Buffer).grow(0xd246c29960, 0x1000, 0xdc35f95348) /usr/local/go/src/bytes/buffer.go:144 +0x15a fp=0xdc35f95298 sp=0xdc35f95248 pc=0x4f9d1a bytes.(Buffer).Write(0xd246c29960, 0xdccdf06000, 0x1000, 0x1000, 0x0, 0x5, 0xc6087d7370) /usr/local/go/src/bytes/buffer.go:174 +0xdc fp=0xdc35f952c8 sp=0xdc35f95298 pc=0x4f9ffc bufio.(Writer).Flush(0xdccb3b0e40, 0x7ed6b65f5a76, 0x43) /usr/local/go/src/bufio/bufio.go:575 +0x75 fp=0xdc35f95328 sp=0xdc35f952c8 pc=0x5211c5 bufio.(Writer).Write(0xdccb3b0e40, 0x7ed6b65f5a76, 0x13a, 0x1461fe3, 0x2, 0x0, 0x0) /usr/local/go/src/bufio/bufio.go:611 +0xeb fp=0xdc35f95388 sp=0xdc35f95328 pc=0x52143b github.com/influxdata/influxdb/tsdb/engine/tsm1.(directIndex).flush(0xdccdd64360, 0x164e840, 0xdccb3b0e40, 0x7ec05342d9a0, 0x13a, 0x12d345f) /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/writer.go:466 +0x1f0 fp=0xdc35f95498 sp=0xdc35f95388 pc=0xfea040 github.com/influxdata/influxdb/tsdb/engine/tsm1.(directIndex).Add(0xdccdd64360, 0x7ec05342d9a0, 0x13a, 0x12d345f, 0x0, 0x15f857ed9c20f2c0, 0x15f857f498449ec0, 0x3c8fe4, 0xdc00000024) /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/writer.go:314 +0x337 fp=0xdc35f95548 sp=0xdc35f95498 pc=0xfe9267 github.com/influxdata/influxdb/tsdb/engine/tsm1.(tsmWriter).WriteBlock(0xdccb3b0ec0, 0x7ec05342d9a0, 0x13a, 0x12d345f, 0x15f857ed9c20f2c0, 0x15f857f498449ec0, 0xdcfd9b5d10, 0x20, 0x29, 0x0, ...) /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/writer.go:686 +0x1eb fp=0xdc35f955b8 sp=0xdc35f95548 pc=0xfeb9db github.com/influxdata/influxdb/tsdb/engine/tsm1.(Compactor).write(0xcadeea85a0, 0xd16cb20140, 0x47, 0x166a0e0, 0xdccae6e6c0, 0xdccce18e01, 0x0, 0x0) /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:1149 +0x283 fp=0xdc35f956c8 sp=0xdc35f955b8 pc=0xf83473 github.com/influxdata/influxdb/tsdb/engine/tsm1.(Compactor).writeNewFiles(0xcadeea85a0, 0x3e1, 0x2, 0xdccab0e680, 0x8, 0x8, 0x166a0e0, 0xdccae6e6c0, 0x1, 0x0, ...) /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:1032 +0x1a5 fp=0xdc35f95780 sp=0xdc35f956c8 pc=0xf82d85 github.com/influxdata/influxdb/tsdb/engine/tsm1.(Compactor).compact(0xcadeea85a0, 0x1522300, 0xdccab0e680, 0x8, 0x8, 0x0, 0x0, 0x0, 0x0, 0x0) /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:940 +0x407 fp=0xdc35f958a8 sp=0xdc35f95780 pc=0xf81fd7 github.com/influxdata/influxdb/tsdb/engine/tsm1.(Compactor).CompactFull(0xcadeea85a0, 0xdccab0e680, 0x8, 0x8, 0x0, 0x0, 0x0, 0x0, 0x0) /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:958 +0x180 fp=0xdc35f95958 sp=0xdc35f958a8 pc=0xf82460 github.com/influxdata/influxdb/tsdb/engine/tsm1.(compactionStrategy).compactGroup(0xd246c298f0) /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:2152 +0x10ba fp=0xdc35f95f40 sp=0xdc35f95958 pc=0xfa8eca github.com/influxdata/influxdb/tsdb/engine/tsm1.(compactionStrategy).Apply(0xd246c298f0) /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:2129 +0x4d fp=0xdc35f95f88 sp=0xdc35f95f40 pc=0xfa7dbd github.com/influxdata/influxdb/tsdb/engine/tsm1.(Engine).compactHiPriorityLevel.func1(0xceddda8490, 0xd7aaf730e0, 0x1, 0xd246c298f0) /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:2046 +0xe0 fp=0xdc35f95fc0 sp=0xdc35f95f88 pc=0xff4000 runtime.goexit() /usr/local/go/src/runtime/asm_amd64.s:1333 +0x1 fp=0xdc35f95fc8 sp=0xdc35f95fc0 pc=0x45d971 created by github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Engine).compactHiPriorityLevel /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:2041 +0x123

goroutine 1 [chan receive, 36 minutes]: main.(*Main).Run(0xc00039bf58, 0xc00003a060, 0x4, 0x4, 0xc00039bf68, 0x1009516) /go/src/github.com/influxdata/influxdb/cmd/influxd/main.go:90 +0x2d1 main.main() /go/src/github.com/influxdata/influxdb/cmd/influxd/main.go:45 +0x12f

goroutine 5 [syscall, 41 minutes]: os/signal.signal_recv(0x0) /usr/local/go/src/runtime/sigqueue.go:139 +0x9c os/signal.loop() /usr/local/go/src/os/signal/signal_unix.go:23 +0x22 created by os/signal.init.0 /usr/local/go/src/os/signal/signal_unix.go:29 +0x41

russorat commented 4 years ago

@zehuaiWANG thanks for the issue. Could you upgrade to the latest 1.7.10 and see if the issue is still there?

zehuaiWANG commented 4 years ago

@russorat Thank you for helping me. Form my prometheus, it have some Warnning: msg="Error sending samples to remote storage" count=100 err="server returned HTTP status 500 Internal Server Error: {\"error\":\"engine: error syncing wal\"} and I look at the influxdb, it also have some warning: 企业微信截图_15832195203653

and i use a ssd and it doesn't seem to have a high I/O usgae: $ iostat -xzt 1 03/03/20 15:24:19 avg-cpu: %user %nice %system %iowait %steal %idle 22.83 0.00 4.46 6.31 0.00 66.40

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await r_await w_await svctm %util nvme0n1 0.00 0.00 12939.36 4435.38 169009.21 129051.84 17.15 3.41 0.19 0.17 0.26 0.02 26.49 sda 0.25 10.65 19.66 3.87 560.52 115.64 28.74 0.00 0.13 0.11 0.19 0.07 0.16

03/03/20 15:24:20 avg-cpu: %user %nice %system %iowait %steal %idle 10.25 0.00 1.45 0.12 0.00 88.17

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await r_await w_await svctm %util nvme0n1 0.00 0.00 0.00 2638.00 0.00 108352.00 41.07 1.78 0.68 0.00 0.68 0.02 5.60 sda 0.00 6.00 0.00 4.00 0.00 80.00 20.00 0.00 0.00 0.00 0.00 0.00 0.00

03/03/20 15:24:21 avg-cpu: %user %nice %system %iowait %steal %idle 12.42 0.00 0.79 0.38 0.00 86.41

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await r_await w_await svctm %util nvme0n1 0.00 0.00 0.00 3005.00 0.00 192544.00 64.07 4.77 1.59 0.00 1.59 0.03 8.00 sda 0.00 4.00 0.00 2.00 0.00 48.00 24.00 0.00 0.00 0.00 0.00 0.00 0.00

I don't know why this happens, and i wonder if you could help me? Thanks a lot.

russorat commented 4 years ago

@zehuaiWANG thanks for the info. Digging though other issues that mention error syncing wal, it sounds like the first place to check is to see if your disks are saturated around the same time you are getting this error: https://github.com/influxdata/influxdb/issues/9544. From the info above, the timestamps don't line up so that might be something to check.

The other thing to check is your wal-fsync-delay value in your influx config: https://github.com/influxdata/influxdb/issues/8758

zehuaiWANG commented 4 years ago

hi~ @russorat , thank you for your reply., I have seen the two issues above. I checked the I / O situation, and the usage rate was only about 15%. At the same time, I was using SSD. I read the introduction of wal-fsync-delay value. He only explained that when using non-SSD, how much should I set to be appropriate if I use SSD?

    # The amount of time that a write will wait before fsyncing. A duration
   # greater than 0 can be used to batch up multiple fsync calls. This is useful for slower
   # disks or when WAL write contention is seen. A value of 0s fsyncs every write to the WAL.
   # Values in the range of 0-100ms are recommended for non-SSD disks.
   # wal-fsync-delay = "0s"
gelldur commented 4 years ago

I have similar situation with 1.7.10 alpine - docker image

stack trace.txt

orgads commented 4 years ago

Same here with 1.7.10 docker.