InfluxDB 0.13.1 OOM on Simple Query

tehmaspc commented 8 years ago

We're running 0.13.0-1 and on a simple select * cpu_value query (collectd metrics) InfluxDB blows up w/ following error:

> select * from cpu_value
fatal error: runtime: out of memory

runtime stack:
runtime.throw(0x8439a0, 0x16)
    /usr/local/go/src/runtime/panic.go:547 +0x90
runtime.sysMap(0xc9c3a00000, 0x100000, 0x7f5d0a242c00, 0x9f56f8)
...

///

SHARDS:

> show shards
name: _internal
---------------
id  database    retention_policy    shard_group start_time      end_time        expiry_time     owners
12  _internal   monitor         12      2016-05-18T00:00:00Z    2016-05-19T00:00:00Z    2016-05-26T00:00:00Z
13  _internal   monitor         13      2016-05-19T00:00:00Z    2016-05-20T00:00:00Z    2016-05-27T00:00:00Z
14  _internal   monitor         14      2016-05-20T00:00:00Z    2016-05-21T00:00:00Z    2016-05-28T00:00:00Z
15  _internal   monitor         15      2016-05-21T00:00:00Z    2016-05-22T00:00:00Z    2016-05-29T00:00:00Z
16  _internal   monitor         16      2016-05-22T00:00:00Z    2016-05-23T00:00:00Z    2016-05-30T00:00:00Z
17  _internal   monitor         17      2016-05-23T00:00:00Z    2016-05-24T00:00:00Z    2016-05-31T00:00:00Z
19  _internal   monitor         19      2016-05-24T00:00:00Z    2016-05-25T00:00:00Z    2016-06-01T00:00:00Z
20  _internal   monitor         20      2016-05-25T00:00:00Z    2016-05-26T00:00:00Z    2016-06-02T00:00:00Z

name: collectd
--------------
id  database    retention_policy    shard_group start_time      end_time        expiry_time     owners
4   collectd    default         4       2016-05-09T00:00:00Z    2016-05-16T00:00:00Z    2016-05-16T00:00:00Z
10  collectd    default         10      2016-05-16T00:00:00Z    2016-05-23T00:00:00Z    2016-05-23T00:00:00Z
18  collectd    default         18      2016-05-23T00:00:00Z    2016-05-30T00:00:00Z    2016-05-30T00:00:00Z

///

ST

> select * from cpu_value
fatal error: runtime: out of memory

runtime stack:
runtime.throw(0x8439a0, 0x16)
    /usr/local/go/src/runtime/panic.go:547 +0x90
runtime.sysMap(0xc9c3a00000, 0x100000, 0x7f5d0a242c00, 0x9f56f8)
    /usr/local/go/src/runtime/mem_linux.go:206 +0x9b
runtime.(*mheap).sysAlloc(0x9dbd20, 0x100000, 0x0)
    /usr/local/go/src/runtime/malloc.go:429 +0x191
runtime.(*mheap).grow(0x9dbd20, 0x8, 0x0)
    /usr/local/go/src/runtime/mheap.go:651 +0x63
runtime.(*mheap).allocSpanLocked(0x9dbd20, 0x1, 0x7f5cf21835e8)
    /usr/local/go/src/runtime/mheap.go:553 +0x4f6
runtime.(*mheap).alloc_m(0x9dbd20, 0x1, 0x7, 0x7f5cf21835e8)
    /usr/local/go/src/runtime/mheap.go:437 +0x119
runtime.(*mheap).alloc.func1()
    /usr/local/go/src/runtime/mheap.go:502 +0x41
runtime.systemstack(0x7f5d0a242d60)
    /usr/local/go/src/runtime/asm_amd64.s:307 +0xab
runtime.(*mheap).alloc(0x9dbd20, 0x1, 0x10000000007, 0x40f6b4)
    /usr/local/go/src/runtime/mheap.go:503 +0x63
runtime.(*mcentral).grow(0x9dd300, 0x0)
    /usr/local/go/src/runtime/mcentral.go:209 +0x93
runtime.(*mcentral).cacheSpan(0x9dd300, 0x7f5cf21835e8)
    /usr/local/go/src/runtime/mcentral.go:89 +0x47d
runtime.(*mcache).refill(0x7f5d0b9fe960, 0xc900000007, 0x7f5cf21835e8)
    /usr/local/go/src/runtime/mcache.go:119 +0xcc
runtime.mallocgc.func2()
    /usr/local/go/src/runtime/malloc.go:642 +0x2b
runtime.systemstack(0xc820018000)
    /usr/local/go/src/runtime/asm_amd64.s:291 +0x79
runtime.mstart()
    /usr/local/go/src/runtime/proc.go:1051

goroutine 1 [running]:
runtime.systemstack_switch()
    /usr/local/go/src/runtime/asm_amd64.s:245 fp=0xc861374850 sp=0xc861374848
runtime.mallocgc(0x60, 0x6f2540, 0x0, 0xc9c39fccc0)
    /usr/local/go/src/runtime/malloc.go:643 +0x869 fp=0xc861374928 sp=0xc861374850
runtime.newarray(0x6f2540, 0x6, 0xc9c39fbd20)
    /usr/local/go/src/runtime/malloc.go:798 +0xc9 fp=0xc861374968 sp=0xc861374928
reflect.unsafe_NewArray(0x6f2540, 0x6, 0x6f2540)
    /usr/local/go/src/runtime/malloc.go:803 +0x2b fp=0xc861374988 sp=0xc861374968
reflect.MakeSlice(0x7f5d0b9b20a8, 0x6e5340, 0x4, 0x6, 0x0, 0x0, 0x0)
    /usr/local/go/src/reflect/value.go:2044 +0x237 fp=0xc8613749f8 sp=0xc861374988
encoding/json.(*decodeState).array(0xc9c1212ec8, 0x6e5340, 0xc9c39de9c8, 0x197)
    /usr/local/go/src/encoding/json/decode.go:507 +0x88f fp=0xc861374c00 sp=0xc8613749f8
encoding/json.(*decodeState).value(0xc9c1212ec8, 0x6e5340, 0xc9c39de9c8, 0x197)
    /usr/local/go/src/encoding/json/decode.go:364 +0x3c1 fp=0xc861374cd8 sp=0xc861374c00
encoding/json.(*decodeState).array(0xc9c1212ec8, 0x6e41a0, 0xc9c1e98450, 0x197)
    /usr/local/go/src/encoding/json/decode.go:518 +0xa6b fp=0xc861374ee0 sp=0xc861374cd8
encoding/json.(*decodeState).value(0xc9c1212ec8, 0x6e41a0, 0xc9c1e98450, 0x197)
    /usr/local/go/src/encoding/json/decode.go:364 +0x3c1 fp=0xc861374fb8 sp=0xc861374ee0
encoding/json.(*decodeState).object(0xc9c1212ec8, 0x7b21c0, 0xc9c1e98420, 0x199)
    /usr/local/go/src/encoding/json/decode.go:684 +0x116a fp=0xc861375360 sp=0xc861374fb8
encoding/json.(*decodeState).value(0xc9c1212ec8, 0x7b21c0, 0xc9c1e98420, 0x199)
    /usr/local/go/src/encoding/json/decode.go:367 +0x3a1 fp=0xc861375438 sp=0xc861375360
encoding/json.(*decodeState).array(0xc9c1212ec8, 0x6e5100, 0xc9c343e700, 0x197)
    /usr/local/go/src/encoding/json/decode.go:518 +0xa6b fp=0xc861375640 sp=0xc861375438
encoding/json.(*decodeState).value(0xc9c1212ec8, 0x6e5100, 0xc9c343e700, 0x197)
    /usr/local/go/src/encoding/json/decode.go:364 +0x3c1 fp=0xc861375718 sp=0xc861375640
encoding/json.(*decodeState).object(0xc9c1212ec8, 0x7940c0, 0xc9c343e700, 0x199)
    /usr/local/go/src/encoding/json/decode.go:684 +0x116a fp=0xc861375ac0 sp=0xc861375718
encoding/json.(*decodeState).value(0xc9c1212ec8, 0x6e14a0, 0xc9c343e700, 0x16)
    /usr/local/go/src/encoding/json/decode.go:367 +0x3a1 fp=0xc861375b98 sp=0xc861375ac0
encoding/json.(*decodeState).unmarshal(0xc9c1212ec8, 0x6e14a0, 0xc9c343e700, 0x0, 0x0)
    /usr/local/go/src/encoding/json/decode.go:168 +0x196 fp=0xc861375c70 sp=0xc861375b98
encoding/json.(*Decoder).Decode(0xc9c1212ea0, 0x6e14a0, 0xc9c343e700, 0x0, 0x0)
    /usr/local/go/src/encoding/json/stream.go:67 +0x274 fp=0xc861375cf0 sp=0xc861375c70
github.com/influxdata/influxdb/client.(*Result).UnmarshalJSON(0xc9b695f800, 0xc9c353e00c, 0x9befa, 0xffdf4, 0x0, 0x0)
    /root/go/src/github.com/influxdata/influxdb/client/influxdb.go:432 +0x1a2 fp=0xc861375dc8 sp=0xc861375cf0
encoding/json.(*decodeState).object(0xc9c1212d28, 0x798e60, 0xc9b695f800, 0x199)
    /usr/local/go/src/encoding/json/decode.go:560 +0x143 fp=0xc861376170 sp=0xc861375dc8
encoding/json.(*decodeState).value(0xc9c1212d28, 0x798e60, 0xc9b695f800, 0x199)
    /usr/local/go/src/encoding/json/decode.go:367 +0x3a1 fp=0xc861376248 sp=0xc861376170
encoding/json.(*decodeState).array(0xc9c1212d28, 0x6e50a0, 0xc9b6fdec30, 0x197)
    /usr/local/go/src/encoding/json/decode.go:518 +0xa6b fp=0xc861376450 sp=0xc861376248
encoding/json.(*decodeState).value(0xc9c1212d28, 0x6e50a0, 0xc9b6fdec30, 0x197)
    /usr/local/go/src/encoding/json/decode.go:364 +0x3c1 fp=0xc861376528 sp=0xc861376450
encoding/json.(*decodeState).object(0xc9c1212d28, 0x77fd00, 0xc9b6fdec30, 0x199)
    /usr/local/go/src/encoding/json/decode.go:684 +0x116a fp=0xc8613768d0 sp=0xc861376528
encoding/json.(*decodeState).value(0xc9c1212d28, 0x6e1440, 0xc9b6fdec30, 0x16)
    /usr/local/go/src/encoding/json/decode.go:367 +0x3a1 fp=0xc8613769a8 sp=0xc8613768d0
encoding/json.(*decodeState).unmarshal(0xc9c1212d28, 0x6e1440, 0xc9b6fdec30, 0x0, 0x0)
    /usr/local/go/src/encoding/json/decode.go:168 +0x196 fp=0xc861376a80 sp=0xc8613769a8
encoding/json.(*Decoder).Decode(0xc9c1212d00, 0x6e1440, 0xc9b6fdec30, 0x0, 0x0)
    /usr/local/go/src/encoding/json/stream.go:67 +0x274 fp=0xc861376b00 sp=0xc861376a80
github.com/influxdata/influxdb/client.(*Response).UnmarshalJSON(0xc9b6fdeba0, 0xc8202c6001, 0x9bf08, 0xffdff, 0x0, 0x0)
    /root/go/src/github.com/influxdata/influxdb/client/influxdb.go:476 +0x1a2 fp=0xc861376bd8 sp=0xc861376b00
encoding/json.(*decodeState).object(0xc820142d28, 0x7a4060, 0xc9b6fdeba0, 0x16)
    /usr/local/go/src/encoding/json/decode.go:560 +0x143 fp=0xc861376f80 sp=0xc861376bd8
encoding/json.(*decodeState).value(0xc820142d28, 0x7a4060, 0xc9b6fdeba0, 0x16)
    /usr/local/go/src/encoding/json/decode.go:367 +0x3a1 fp=0xc861377058 sp=0xc861376f80
encoding/json.(*decodeState).unmarshal(0xc820142d28, 0x7a4060, 0xc9b6fdeba0, 0x0, 0x0)
    /usr/local/go/src/encoding/json/decode.go:168 +0x196 fp=0xc861377130 sp=0xc861377058
encoding/json.(*Decoder).Decode(0xc820142d00, 0x7a4060, 0xc9b6fdeba0, 0x0, 0x0)
    /usr/local/go/src/encoding/json/stream.go:67 +0x274 fp=0xc8613771b0 sp=0xc861377130
github.com/influxdata/influxdb/client.(*ChunkedResponse).NextResponse(0xc8613772a0, 0xc9a2b78f40, 0x0, 0x0)
    /root/go/src/github.com/influxdata/influxdb/client/influxdb.go:517 +0x69 fp=0xc861377208 sp=0xc8613771b0
github.com/influxdata/influxdb/client.(*Client).Query(0xc8200c4180, 0xc8200ea900, 0x17, 0xc82000bfe4, 0x8, 0x1, 0x0, 0x0, 0x0, 0x0)
    /root/go/src/github.com/influxdata/influxdb/client/influxdb.go:203 +0x966 fp=0xc861377520 sp=0xc861377208
github.com/influxdata/influxdb/cmd/influx/cli.(*CommandLine).ExecuteQuery(0xc82008a280, 0xc8200ea900, 0x17, 0x0, 0x0)
    /root/go/src/github.com/influxdata/influxdb/cmd/influx/cli/cli.go:538 +0x11a fp=0xc8613776b8 sp=0xc861377520
github.com/influxdata/influxdb/cmd/influx/cli.(*CommandLine).ParseCommand(0xc82008a280, 0xc8200ea900, 0x17, 0x0, 0x0)
    /root/go/src/github.com/influxdata/influxdb/cmd/influx/cli/cli.go:250 +0x329 fp=0xc861377780 sp=0xc8613776b8
github.com/influxdata/influxdb/cmd/influx/cli.(*CommandLine).mainLoop(0xc82008a280, 0x0, 0x0)
    /root/go/src/github.com/influxdata/influxdb/cmd/influx/cli/cli.go:203 +0x229 fp=0xc8613778c8 sp=0xc861377780
github.com/influxdata/influxdb/cmd/influx/cli.(*CommandLine).Run(0xc82008a280, 0x0, 0x0)
    /root/go/src/github.com/influxdata/influxdb/cmd/influx/cli/cli.go:181 +0x1483 fp=0xc861377da8 sp=0xc8613778c8
main.main()
    /root/go/src/github.com/influxdata/influxdb/cmd/influx/main.go:112 +0x964 fp=0xc861377f30 sp=0xc861377da8
runtime.main()
    /usr/local/go/src/runtime/proc.go:188 +0x2b0 fp=0xc861377f80 sp=0xc861377f30
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:1998 +0x1 fp=0xc861377f88 sp=0xc861377f80

goroutine 17 [syscall, 2 minutes, locked to thread]:
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:1998 +0x1

goroutine 5 [syscall, 2 minutes]:
os/signal.signal_recv(0x7f5d0b9ae078)
    /usr/local/go/src/runtime/sigqueue.go:116 +0x132
os/signal.loop()
    /usr/local/go/src/os/signal/signal_unix.go:22 +0x18
created by os/signal.init.1
    /usr/local/go/src/os/signal/signal_unix.go:28 +0x37

goroutine 6 [select, 2 minutes, locked to thread]:
runtime.gopark(0x89feb8, 0xc820026728, 0x7fb810, 0x6, 0x18, 0x2)
    /usr/local/go/src/runtime/proc.go:262 +0x163
runtime.selectgoImpl(0xc820026728, 0x0, 0x18)
    /usr/local/go/src/runtime/select.go:392 +0xa67
runtime.selectgo(0xc820026728)
    /usr/local/go/src/runtime/select.go:215 +0x12
runtime.ensureSigM.func1()
    /usr/local/go/src/runtime/signal1_unix.go:279 +0x358
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:1998 +0x1

goroutine 36 [select, 2 minutes]:
net/http.(*persistConn).readLoop(0xc8200fc270)
    /usr/local/go/src/net/http/transport.go:1182 +0xd52
created by net/http.(*Transport).dialConn
    /usr/local/go/src/net/http/transport.go:857 +0x10a6

goroutine 37 [select, 2 minutes]:
net/http.(*persistConn).writeLoop(0xc8200fc270)
    /usr/local/go/src/net/http/transport.go:1277 +0x472
created by net/http.(*Transport).dialConn
    /usr/local/go/src/net/http/transport.go:858 +0x10cb

SERVER:

Ubuntu 14.04.3 LTS / AWS c4.xlarge

jwilder commented 8 years ago

This stack trace is from the Influx cli running out of memory when unmarshaling the json response.

tehmaspc commented 8 years ago

Yes. But it crashed the server. Same query on admin GUI - crashes influxd.

On May 25, 2016, at 16:00, Jason Wilder notifications@github.com wrote:

This stack trace is from the Influx cli running out of memory when unmarshaling the json response.

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub

toddboom commented 8 years ago

@tehmaspc could you also send the stack trace from the server?

tehmaspc commented 8 years ago

@toddboom @jwilder

https://gist.github.com/joelegasse/605dc115cf0e553111c3077e4840aa47

jwilder commented 8 years ago

@jsternberg @joelegasse any thoughts on this?

joelegasse commented 8 years ago

@tehmaspc Are those traces from the same exact moment? The client shouldn't get any data when the server OOMs (the failure point is well before any bytes are written out for the response). If they are, can you create a gist with more details (data directory size, series count, field keys, etc), reproduction steps, and the associated logs/traces?

The CLI will chunk the responses over the wire, and then aggregate them as they are received. Which will lead to your first OOM trace if you have lots of data. The admin UI will just submit the query as you enter it, and it mostly meant for simple exploration, and is not meant to be a robust interface.

@jwilder Looking at the code in both areas, there are few things we're doing with append that can probably be cleaned up. But overall, I think it comes down to either requesting chunked responses to prevent the server OOM or making sure there is sufficient memory for the amount of data being queried.

@tehmaspc What is your setting for max-row-limit for the http service? Try setting that to a lower value (to prevent the OOM) or adding a time-range to your queries. Generally, a SELECT * FROM thing query is probably not a good idea, especially for retention policies with infinite duration and non-trivial amounts of data.

In a more humorous analogy, SELECT * FROM thing may be a "simple" query, but so is rm -rf / :-)

tehmaspc commented 8 years ago

@joelegasse your response to @jwilder is the same conclusion I'm come to after playing around w/ InfluxDB for some time now and as of yesterday going through previous GitHub issues and noticing that other folks have had issues w/ InfluxDB and memory utilization. It takes too long to check for data when a particular series is large - so perhaps paging by default makes sense. But I'm not a database design expert :)

As for the query used to describe this issue - an important fact is that it did work until yesterday meaning it got to a point where there is too much data that either the CLI or AdminUI pulls back.

Regardless, I have no intention to use such an exhaustive query - it is something that I've been doing to check the data in InfluxDB coming from my collectd agents during my testing and building out of an internal metrics stack for the company. My deep concerns are now that queries can be run against InfluxDB simple or not and crash the entire server. That's definitely concerning especially since InfluxDB is nearing a 1.0 release.

I'll look at max-row-limit but Grafana will sit on top of this database and it will control the queries / etc. Whether the queries are expensive or not - which we are not arguing - the fact that the database can't handle it w/o panic'ing would make it difficult to build any graphing on top of the metrics we collect. We cannot somehow add time ranges to all the graphs and templates we could potentially build in Grafana - at least that doesn't make sense to me.

I will repro this in just a bit - but I did successfully crash the server two times consecutively yesterday with that same select * query.

BTW - as of today (0.13.1) is there any influxd conf file tuning I can do to limit ballooning of memory / etc? Anything that might mitigate (not necessarily completely prevent) the above issue? I'm still learning InfluxDB internals.

Thanks guys!

Tehmasp Chaudhri

joelegasse commented 8 years ago

@tehmaspc Here's some discussion on limiting various aspects to prevent queries from consuming all the server's resources: #6024

The max-memory option was discussed and not implemented, as it would be definitely be complex change with many edge cases and pitfalls. Specifically, we would have to have some way to at least approximate the amount of memory allocated as the result of a given query, which would be a challenge in a concurrent, garbage-collected language such as Go. That's not to say we don't think it would be a valuable limit to add, just that it's not valuable enough at the time. It's better that we really stress the current limitations of buffering the entire result set for queries.

SELECT * FROM ... can be run as a chunked query, and if processed by something that processes each chunk before reading the next, shouldn't cause too big of a problem.

jsternberg commented 8 years ago

The chunked option may not help if the garbage collector doesn't free the memory from the previous chunked results. It's still allocating a new slice that can be very large for every chunk. It seems like the out of memory error happened when trying to grow a slice so that's a possible reason why it still ran out of memory on a simple query.

tehmaspc commented 8 years ago

FWIW, we've begun using query-timeout (10s) as well to mitigate this issue somewhat.

royalaid commented 8 years ago

Just want to chime in and say I am experiencing this as well. Gist here if it helps

e-dard commented 7 years ago

This should no longer happen in recent versions of Influx.

influxdata / influxdb

InfluxDB 0.13.1 OOM on Simple Query #6728