Closed stevenh closed 8 years ago
Just put in a PR which should help with this https://github.com/influxdata/influxdb/pull/7228
Another thing that could help significantly is to introduce the concept of handlers. By doing this the amount of contention on the lock could be controlled, something like having 10 write handlers, 1 snapshot handler and 50 read handlers.
Each type would be processed by sending requests to a buffered channel which the relevant handlers read from and process. If the buffer is full the action could be rejected preventing excessive backlog build up, which in turn would ensure that memory usage stays bounded.
In addition to this sharding the cache and looking to eliminate the impact of GC should also be done, as the current cache is likely to cause significant GC pauses due to the use of pointers, see http://allegro.tech/2016/03/writing-fast-cache-service-in-go.html for details.
@stevenh Have you tested out 1.0 yet? There was a pull merged which may help with this: https://github.com/influxdata/influxdb/pull/7258. There is also some in flight work that might help: https://github.com/influxdata/influxdb/pull/7165
We're currently testing 1.0 yes but we haven't let a compaction happen yet.
Given the description of the problem in #7258 I don't believe that's the cause of this, as it indicates it was a irrecoverable situation, so given stopping the incoming data allowed recovery from our issue it seems unlikely to be the cause.
Also just read through #7165 and it doesn't seem to be related, neither of these seem to address the underlying issue of the high lock contention or the data race which will cause invalid cache size restriction.
@stevenh What size of writes are you sending? Are you batching writes or just sending many small ones? Can you provide some sample writes in line protocol format?
Also, compactions are not the same thing as retention policy expiring shards. Compactions happen all the time while writes are occurring on a shard. Snapshot compactions are the only ones that lock the cache, but they do not hold onto the locks while the compactions are run.
The large heap allocations from queries look like they may be querying against large number of series which may indicate a cardinality issue with your tags or a query that is not selective enough. There are also a number of goroutines gzipping responses which is odd. Perhaps a slow query reader?
You may want to try enabling some of these limits to help mitigate some of the issues as well as pinpoint any problem queries:
[http]
max-connection-limit = 20
[coordinator]
max-concurrent-queries = 20
query-timeout = "1m"
log-queries-after = "30s"
max-select-series = 100000
@jwilder I'll add the settings to our setup later on today :)
In the meantime, I grabbed a backup of production data and put it against 1.0 in our dev environment and set the retention down to 7 days. After ~30 minutes the retention kicked in and deleted ~271,000 series (kind of expected) - from our graphing the retention took 8 minutes and 30 seconds to run, but during this time it completely stopped writing data to both the _internal
database and other(s)
I'm going to retest this shortly and grab heap, goroutine, blocks and vars while the node isn't graphing.
block.txt goroutine.txt heap.txt vars.txt
I managed to grab these while reducing the retention policy on the same backup we used during the test yesterday.
@liv3d I think I see part of the issue in one of those traces now. When the shard deletion kicks in, the shard is Closed
and a write lock is held during this time. The Close
takes a while and during that time stats collection gets blocked which is why you don't see any data in _internal
during this time.
Not sure why other writes are getting block yet though.
@liv3d I have a possible fix for this in #7314. Would you be able to test that PR out in your environment to see if it resolves this issue?
With regards other writes getting blocked, the initial traces indicated that they weren't totally blocked they where just progressing VERY slowly due to the cache lock contention.
@jwilder running that build, I get panics when querying _internal or doing show stats
- these are attached at here
@liv3d I think that panic is due to a recent change in master unrelated to this. I'll rebase the PR off of the 1.0 branch while we track down the cause of that panic.
The panic is a regression due to #7296
@liv3d The fix for that panic is in master now. If you pull down the updated #7314 PR, it should have that fix in there now.
@jwilder I'm not convinced that panic was unrelated. I tried show stats again and it still panic's.
[danof@influxdb-1.labs ~]$ influxd version
InfluxDB v1.0.0 (git: jw-delete-shard 0063ca4842a83eb0290c06c3e48b9783c63d1869)
[danof@influxdb-1.labs ~]$ influx -execute "SHOW STATS"
ERR: SHOW STATS [panic:assignment to entry in nil map]
Warning: It is possible this error is due to not setting a database.
Please set a database with the command "use <database>".
SHOW STATS [panic:assignment to entry in nil map]
[danof@influxdb-1.labs ~]$
My box on 1.0 GA does:
[danof@influxdb-2.labs ~]$ influxd version
InfluxDB v1.0.0 (git: master 37992377a55fbc138b2c01edd4deffed64b53989)
[danof@influxdb-2.labs ~]$ influx -execute "SHOW STATS"
name: cq
--------
queryFail queryOk
0 13972
...
@liv3d That's actually a different panic in the stats collection for subscribers. It was introduced in #7177 and is now fixed with #7318. I rebased on top of that fix now. master
has many changes for 1.1
right now so it's more unstable. Can you test again?
The good news @jwilder is I can indeed run SHOW STATS
. I'm going to grab some food and retest in a bit.
@jwilder This looks way better from my quick alerting of retention:
Showing the series drop
I'm also seeing the memory being roughly a 10th of what it was for 1.0 GA
@stevenh We'll need to do more testing during the week, but this actually looks good
@liv3d @stevenh Have you had a chance to test further?
@jwilder I've done some more testing and I'm pretty happy with the change. Hopefully this can make the next release I did notice that during the shard retention that inserts go from ~0.09s to ~2.50s, however none of them failed
I'm a little bit late to the party on this one, but I've also been having problems dropping series / measurements on a roughly 75 GB database with just under 2M series. I tried to drop a measurement earlier today, which took a while, but eventually the command returned to the prompt. However, in the following minutes, influxd started to eat a ton of memory (32 GB + 4 GB swap) and eventually crashed. I've been very hesitant about deleting stuff because this behaviour certainly is not new.
Here's hoping that this patch improves things.
@dswarbrick #7165 may also help you out. If you are able to test it out, that would be very helpful.
That PR needs to be rebased w/ master though.
Bug report
Influxdb: 1.0.0-beta3 OS Linux: 3.10.0-327.el7.x86_64
Steps to reproduce:
Expected behaviour: Compaction should have little impact on memory
Actual behaviour: Memory increases until it triggers out of memory
Additional info: Looking into the details from pprof its clear that the issue is serious contention on the cache lock, enough so that all queries (read / write) to influx are stalling pretty much indefinitely while compaction takes place.
When incoming requests get behind, the problem spirals out of control with more and more memory used for query goroutines.
When we captured it we had 100k goroutines waiting on the cache lock using tsm1.newIntegerAscendingCursor accounting for 129GB of active memory, this is very clear to see in heep.pdf below.
In our case we have a master slave setup for reads, with this node being the slave; so its only taking writes from influx-relay unless the primary is down, which it wasn't so all of this we believe is write not read traffic.
Once we realised this we disabled writes by taking this node out of influx-relay and it recovered in ~30mins even from 312GB. At the same point we saw a query complete which took 13555 minutes.
Machine Spec
Logs: block.txt goroutine2.txt heap2.txt vars.txt iostat.txt heap.pdf