Shard deletion triggers infinite memory usage due to cache lock contention

stevenh commented 8 years ago

Bug report

Influxdb: 1.0.0-beta3 OS Linux: 3.10.0-327.el7.x86_64

Steps to reproduce:

Ingress data to influx at a decent rate
Wait for 7 day retention policy to trigger
Watch memory rise indefinitely

Expected behaviour: Compaction should have little impact on memory

Actual behaviour: Memory increases until it triggers out of memory

Additional info: Looking into the details from pprof its clear that the issue is serious contention on the cache lock, enough so that all queries (read / write) to influx are stalling pretty much indefinitely while compaction takes place.

When incoming requests get behind, the problem spirals out of control with more and more memory used for query goroutines.

When we captured it we had 100k goroutines waiting on the cache lock using tsm1.newIntegerAscendingCursor accounting for 129GB of active memory, this is very clear to see in heep.pdf below.

In our case we have a master slave setup for reads, with this node being the slave; so its only taking writes from influx-relay unless the primary is down, which it wasn't so all of this we believe is write not read traffic.

Once we realised this we disabled writes by taking this node out of influx-relay and it recovered in ~30mins even from 312GB. At the same point we saw a query complete which took 13555 minutes.

Machine Spec

2 x Intel(R) Xeon(R) CPU E5-2643 0 @ 3.30GHz
168GB RAM
Disk provided by a Purestorage SSD backed over 10Gbps iSCSI

Logs: block.txt goroutine2.txt heap2.txt vars.txt iostat.txt heap.pdf

stevenh commented 8 years ago

Just put in a PR which should help with this https://github.com/influxdata/influxdb/pull/7228

stevenh commented 8 years ago

Another thing that could help significantly is to introduce the concept of handlers. By doing this the amount of contention on the lock could be controlled, something like having 10 write handlers, 1 snapshot handler and 50 read handlers.

Each type would be processed by sending requests to a buffered channel which the relevant handlers read from and process. If the buffer is full the action could be rejected preventing excessive backlog build up, which in turn would ensure that memory usage stays bounded.

In addition to this sharding the cache and looking to eliminate the impact of GC should also be done, as the current cache is likely to cause significant GC pauses due to the use of pointers, see http://allegro.tech/2016/03/writing-fast-cache-service-in-go.html for details.

jackzampolin commented 8 years ago

@stevenh Have you tested out 1.0 yet? There was a pull merged which may help with this: https://github.com/influxdata/influxdb/pull/7258. There is also some in flight work that might help: https://github.com/influxdata/influxdb/pull/7165

stevenh commented 8 years ago

We're currently testing 1.0 yes but we haven't let a compaction happen yet.

Given the description of the problem in #7258 I don't believe that's the cause of this, as it indicates it was a irrecoverable situation, so given stopping the incoming data allowed recovery from our issue it seems unlikely to be the cause.

stevenh commented 8 years ago

Also just read through #7165 and it doesn't seem to be related, neither of these seem to address the underlying issue of the high lock contention or the data race which will cause invalid cache size restriction.

jwilder commented 8 years ago

@stevenh What size of writes are you sending? Are you batching writes or just sending many small ones? Can you provide some sample writes in line protocol format?

Also, compactions are not the same thing as retention policy expiring shards. Compactions happen all the time while writes are occurring on a shard. Snapshot compactions are the only ones that lock the cache, but they do not hold onto the locks while the compactions are run.

The large heap allocations from queries look like they may be querying against large number of series which may indicate a cardinality issue with your tags or a query that is not selective enough. There are also a number of goroutines gzipping responses which is odd. Perhaps a slow query reader?

You may want to try enabling some of these limits to help mitigate some of the issues as well as pinpoint any problem queries:

[http]
max-connection-limit = 20

[coordinator]
  max-concurrent-queries = 20
  query-timeout = "1m"
  log-queries-after = "30s"
  max-select-series = 100000

liv3d commented 8 years ago

@jwilder I'll add the settings to our setup later on today :)

In the meantime, I grabbed a backup of production data and put it against 1.0 in our dev environment and set the retention down to 7 days. After ~30 minutes the retention kicked in and deleted ~271,000 series (kind of expected) - from our graphing the retention took 8 minutes and 30 seconds to run, but during this time it completely stopped writing data to both the _internal database and other(s)

heap graph

tsm metrics

writes

I'm going to retest this shortly and grab heap, goroutine, blocks and vars while the node isn't graphing.

liv3d commented 8 years ago

block.txt goroutine.txt heap.txt vars.txt

I managed to grab these while reducing the retention policy on the same backup we used during the test yesterday.

jwilder commented 8 years ago

@liv3d I think I see part of the issue in one of those traces now. When the shard deletion kicks in, the shard is Closed and a write lock is held during this time. The Close takes a while and during that time stats collection gets blocked which is why you don't see any data in _internal during this time.

Not sure why other writes are getting block yet though.

jwilder commented 8 years ago

@liv3d I have a possible fix for this in #7314. Would you be able to test that PR out in your environment to see if it resolves this issue?

stevenh commented 8 years ago

With regards other writes getting blocked, the initial traces indicated that they weren't totally blocked they where just progressing VERY slowly due to the cache lock contention.

liv3d commented 8 years ago

@jwilder running that build, I get panics when querying _internal or doing show stats - these are attached at here

jwilder commented 8 years ago

@liv3d I think that panic is due to a recent change in master unrelated to this. I'll rebase the PR off of the 1.0 branch while we track down the cause of that panic.

jwilder commented 8 years ago

The panic is a regression due to #7296

jwilder commented 8 years ago

@liv3d The fix for that panic is in master now. If you pull down the updated #7314 PR, it should have that fix in there now.

liv3d commented 8 years ago

@jwilder I'm not convinced that panic was unrelated. I tried show stats again and it still panic's.

[danof@influxdb-1.labs ~]$ influxd version
InfluxDB v1.0.0 (git: jw-delete-shard 0063ca4842a83eb0290c06c3e48b9783c63d1869)
[danof@influxdb-1.labs ~]$ influx -execute "SHOW STATS"
ERR: SHOW STATS [panic:assignment to entry in nil map]
Warning: It is possible this error is due to not setting a database.
Please set a database with the command "use <database>".
SHOW STATS [panic:assignment to entry in nil map]
[danof@influxdb-1.labs ~]$

My box on 1.0 GA does:

[danof@influxdb-2.labs ~]$ influxd version
InfluxDB v1.0.0 (git: master 37992377a55fbc138b2c01edd4deffed64b53989)
[danof@influxdb-2.labs ~]$ influx -execute "SHOW STATS"
name: cq
--------
queryFail   queryOk
0       13972
...

jwilder commented 8 years ago

@liv3d That's actually a different panic in the stats collection for subscribers. It was introduced in #7177 and is now fixed with #7318. I rebased on top of that fix now. master has many changes for 1.1 right now so it's more unstable. Can you test again?

liv3d commented 8 years ago

The good news @jwilder is I can indeed run SHOW STATS. I'm going to grab some food and retest in a bit.

liv3d commented 8 years ago

@jwilder This looks way better from my quick alerting of retention:

Showing the series drop

I'm also seeing the memory being roughly a 10th of what it was for 1.0 GA Memory

@stevenh We'll need to do more testing during the week, but this actually looks good

jwilder commented 8 years ago

@liv3d @stevenh Have you had a chance to test further?

liv3d commented 8 years ago

@jwilder I've done some more testing and I'm pretty happy with the change. Hopefully this can make the next release I did notice that during the shard retention that inserts go from ~0.09s to ~2.50s, however none of them failed

dswarbrick commented 8 years ago

I'm a little bit late to the party on this one, but I've also been having problems dropping series / measurements on a roughly 75 GB database with just under 2M series. I tried to drop a measurement earlier today, which took a while, but eventually the command returned to the prompt. However, in the following minutes, influxd started to eat a ton of memory (32 GB + 4 GB swap) and eventually crashed. I've been very hesitant about deleting stuff because this behaviour certainly is not new.

Here's hoping that this patch improves things.

jwilder commented 8 years ago

@dswarbrick #7165 may also help you out. If you are able to test it out, that would be very helpful.

jwilder commented 8 years ago

That PR needs to be rebased w/ master though.

influxdata / influxdb

Shard deletion triggers infinite memory usage due to cache lock contention #7226

Bug report