Closed tehmaspc closed 7 years ago
This stack trace is from the Influx cli running out of memory when unmarshaling the json response.
Yes. But it crashed the server. Same query on admin GUI - crashes influxd.
On May 25, 2016, at 16:00, Jason Wilder notifications@github.com wrote:
This stack trace is from the Influx cli running out of memory when unmarshaling the json response.
— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub
@tehmaspc could you also send the stack trace from the server?
@toddboom @jwilder
https://gist.github.com/joelegasse/605dc115cf0e553111c3077e4840aa47
@jsternberg @joelegasse any thoughts on this?
@tehmaspc Are those traces from the same exact moment? The client shouldn't get any data when the server OOMs (the failure point is well before any bytes are written out for the response). If they are, can you create a gist with more details (data directory size, series count, field keys, etc), reproduction steps, and the associated logs/traces?
The CLI will chunk the responses over the wire, and then aggregate them as they are received. Which will lead to your first OOM trace if you have lots of data. The admin UI will just submit the query as you enter it, and it mostly meant for simple exploration, and is not meant to be a robust interface.
@jwilder Looking at the code in both areas, there are few things we're doing with append
that can probably be cleaned up. But overall, I think it comes down to either requesting chunked responses to prevent the server OOM or making sure there is sufficient memory for the amount of data being queried.
@tehmaspc What is your setting for max-row-limit
for the http service? Try setting that to a lower value (to prevent the OOM) or adding a time-range to your queries. Generally, a SELECT * FROM thing
query is probably not a good idea, especially for retention policies with infinite duration and non-trivial amounts of data.
In a more humorous analogy, SELECT * FROM thing
may be a "simple" query, but so is rm -rf /
:-)
@joelegasse your response to @jwilder is the same conclusion I'm come to after playing around w/ InfluxDB for some time now and as of yesterday going through previous GitHub issues and noticing that other folks have had issues w/ InfluxDB and memory utilization. It takes too long to check for data when a particular series is large - so perhaps paging by default makes sense. But I'm not a database design expert :)
As for the query used to describe this issue - an important fact is that it did work until yesterday meaning it got to a point where there is too much data that either the CLI or AdminUI pulls back.
Regardless, I have no intention to use such an exhaustive query - it is something that I've been doing to check the data in InfluxDB coming from my collectd agents during my testing and building out of an internal metrics stack for the company. My deep concerns are now that queries can be run against InfluxDB simple or not and crash the entire server. That's definitely concerning especially since InfluxDB is nearing a 1.0 release.
I'll look at max-row-limit but Grafana will sit on top of this database and it will control the queries / etc. Whether the queries are expensive or not - which we are not arguing - the fact that the database can't handle it w/o panic'ing would make it difficult to build any graphing on top of the metrics we collect. We cannot somehow add time ranges to all the graphs and templates we could potentially build in Grafana - at least that doesn't make sense to me.
I will repro this in just a bit - but I did successfully crash the server two times consecutively yesterday with that same select *
query.
BTW - as of today (0.13.1) is there any influxd conf file tuning I can do to limit ballooning of memory / etc? Anything that might mitigate (not necessarily completely prevent) the above issue? I'm still learning InfluxDB internals.
Thanks guys!
Tehmasp Chaudhri
@tehmaspc Here's some discussion on limiting various aspects to prevent queries from consuming all the server's resources: #6024
The max-memory
option was discussed and not implemented, as it would be definitely be complex change with many edge cases and pitfalls. Specifically, we would have to have some way to at least approximate the amount of memory allocated as the result of a given query, which would be a challenge in a concurrent, garbage-collected language such as Go. That's not to say we don't think it would be a valuable limit to add, just that it's not valuable enough at the time. It's better that we really stress the current limitations of buffering the entire result set for queries.
SELECT * FROM ...
can be run as a chunked query, and if processed by something that processes each chunk before reading the next, shouldn't cause too big of a problem.
The chunked option may not help if the garbage collector doesn't free the memory from the previous chunked results. It's still allocating a new slice that can be very large for every chunk. It seems like the out of memory error happened when trying to grow a slice so that's a possible reason why it still ran out of memory on a simple query.
FWIW, we've begun using query-timeout
(10s) as well to mitigate this issue somewhat.
Just want to chime in and say I am experiencing this as well. Gist here if it helps
This should no longer happen in recent versions of Influx.
We're running 0.13.0-1 and on a simple
select * cpu_value
query (collectd metrics) InfluxDB blows up w/ following error:///
SHARDS:
///
ST
SERVER:
Ubuntu 14.04.3 LTS / AWS c4.xlarge