influxdata / influxdb

Scalable datastore for metrics, events, and real-time analytics
https://influxdata.com
Apache License 2.0
28.97k stars 3.56k forks source link

Influxdb crashes after 2 hours #7640

Closed rubycut closed 7 years ago

rubycut commented 7 years ago

Bug report

System info: Influxdb 1.1.0 Os: Debian

Steps to reproduce:

  1. Start influxdb
  2. Create 500 different databases with 500 users
  3. Auth should be on

Expected behavior:

Should run normally

Actual behavior:

Crashes after two hours

rubycut commented 7 years ago

At the time of crash, there was 114k goroutines in the system:

grep "goroutine " error2.log|wc -l
114551

Roughly half of them were stuck at the same place:

grep engine.go:979 error2.log|wc -l
56118
goroutine 19985 [sleep]:
time.Sleep(0x3b9aca00)
    /usr/local/go/src/runtime/time.go:59 +0xe1
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Engine).compactTSMLevel(0xc429826100, 0xb74601, 0x1, 0xc428dd2300)
    /root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:979 +0x90
jwilder commented 7 years ago

@rubycut Can you attach the full panic? The two links provided cut off the actual cause of the panic.

Can you also run the following and attach to the issue:

curl -o block.txt "http://localhost:8086/debug/pprof/block?debug=1" 
curl -o goroutine.txt "http://localhost:8086/debug/pprof/goroutine?debug=1" 
curl -o heap.txt "http://localhost:8086/debug/pprof/heap?debug=1" 
curl -o vars.txt "http://localhost:8086/debug/vars" 
iostat -xd 1 30 > iostat.txt
uname -a > uname.txt
influx -execute "show shards" > shards.txt
influx -execute "show stats" > stats.txt
influx -execute "show diagnostics" > diagnostics.txt
rubycut commented 7 years ago
uname -a
Linux dedxxxx 3.10.95xx #30 SMP Fri Jan 29 11:54:46 CST 2016 x86_64 GNU/Linux
rubycut commented 7 years ago

iostat.txt

rubycut commented 7 years ago
rubycut commented 7 years ago

shards.txt

rubycut commented 7 years ago

diagnostics:

name: build
-----------
Branch  Build Time      Commit                                          Version
master                  800da5732b91c816b0a097acf8887fa2af1efa1a        1.1.0

name: network
-------------
hostname
dedxxxx

name: runtime
-------------
GOARCH  GOMAXPROCS      GOOS    version
amd64   24              linux   go1.7.3

name: system
------------
PID     currentTime                     started                         uptime
5324    2016-11-18T20:27:10.806259576Z  2016-11-18T20:01:05.251612142Z  26m5.554647524s
jwilder commented 7 years ago

@rubycut Can you attach show retention policies as well? The number of goroutines is high because it looks like you have 1 day shard group durations for all your shards but fairly large retention range. You also have a large number of databases as well.

rubycut commented 7 years ago

vars.txt

https://www.dropbox.com/s/h5ftawg2ax2baoc/vars.txt?dl=0

rubycut commented 7 years ago

@jwilder , show retention policies doesn't work, database must be selected first, so in all databases, retention policy is the same, it looks like this:

> show retention policies
name            duration        shardGroupDuration      replicaN        default
default         0s              168h0m0s                1               false
four_months     3000h0m0s       24h0m0s                 1               true
rubycut commented 7 years ago

@jwilder , I attached full panic, there is nothing important before panic, influxdb was writing, business as usual:

[httpd] <ip_hidden> - c3941_telegraf [18/Nov/2016:12:16:03 +0000] "POST /write?consistency=&db=c3941&precision=ns&rp= HTTP/1.1" 204 0 "-" "InfluxDBClient" c6823a4f-ad88-11e6-9ab6-000000000000 98278
[httpd] <ip_hidden> - c3503_telegraf [18/Nov/2016:12:16:04 +0000] "GET /query?db=&q=CREATE+DATABASE+IF+NOT+EXISTS+%22c3503%22 HTTP/1.1" 400 94 "-" "InfluxDBClient" c6939fb5-ad88-11e6-9ac5-000000000000 993
[httpd] <ip_hidden> - c3439_telegraf [18/Nov/2016:12:16:03 +0000] "POST /write?consistency=&db=c3439&precision=ns&rp= HTTP/1.1" 204 0 "-" "InfluxDBClient" c6870f8e-ad88-11e6-9ab9-000000000000 85296
[httpd] <ip_hidden> - c3623_telegraf [18/Nov/2016:12:16:04 +0000] "POST /write?consistency=&db=c3623&precision=ns&rp= HTTP/1.1" 204 0 "-" "InfluxDBClient" c693d5fa-ad88-11e6-9ac6-000000000000 2071
unexpected fault address 0x0
fatal error: fault
[signal SIGSEGV: segmentation violation code=0x80 addr=0x0 pc=0x40ea9d]
rubycut commented 7 years ago
curl "http://localhost:8086/debug/pprof/goroutine?debug=1"
404 page not found

pprof is not working, I am using stock influxdb 1.1.0 downloaded from your side, do I need to recompile or use anything else to turn pprof on?

jwilder commented 7 years ago

@rubycut Can you attach the full trace? That tells me what the panic was, but not where it occurred or what else was going on at the time.

For your shards, I'd recommend using a larger shard group duration. For a four month retention period, a shard group duration of 1 mon would significantly reduce the number of shards and goroutines running on your system. The number of databases compounds the problem as well. The shards don't look very big so smaller shard durations is likely just creating overhead for little benefit.

It looks like you may be using a custom kernel as well which might be having problems with the number of goroutines/shards/etc.. or other dependent system resources associated with them.

jwilder commented 7 years ago

You may need to set:

[http]
pprof-enabled = true
rubycut commented 7 years ago

stats.txt:

https://www.dropbox.com/s/nux9u48fgiuul6i/stats.txt?dl=0

rubycut commented 7 years ago

@jwilder , when influxdb crashes, it writes state in the log, this is what I already sent you, I don't know any other way how to obtain full trace.

jwilder commented 7 years ago

@rubycut The two dropbox links only have the end of the trace. The beginning part is cut off which doesn't indicate what the panic was or where it occurred. This is all I see:

x197

goroutine 119256 [sleep]:
time.Sleep(0x3b9aca00)
    /usr/local/go/src/runtime/time.go:59 +0xe1
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Engine).compactCache(0xc6aeb83800, 0xc727eece40)
    /root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:953 +0xd7
github.com/influxdata/influxdb/tsdb/engine/tsm1.

... rest of trace...
rubycut commented 7 years ago

oh man, dropbox sucks, here is the gist of trace:

https://gist.github.com/rubycut/d35daf01887b5d23ddd4571c500c8e64

rubycut commented 7 years ago

As we changed shardGroupDuration from one day to two weeks, number of goroutines is going down, we are at 111k goroutines, they go down by 1k each day.

Today, influxdb was running for 4.5 hours without going down.

rubycut commented 7 years ago

But we are observing strange memory behavior compared with influxdb version 0.13.

Blue line is Influxdb 1.1.0, while green is influxdb 0.13, both servers are running same load.

As you can see influxdb 1.1.0 has constantly rising memory footprint, and in many cases I suspect that memory could be cause of crash, although server doesn't seem to die from OOM killer.

Sudden drop in blue line every couple of hours when influxdb crashes.

screen shot 2016-11-24 at 15 48 35

rubycut commented 7 years ago

This is after running for 50 minutes:

goroutine.txt: https://gist.github.com/995cd72876f8638447fab50af25995d5 block.txt: https://gist.github.com/d3b83295fa62e931af14f77298d57427 heap.txt: https://gist.github.com/595f25f3f77f3e597e060ec9d73a314b

rubycut commented 7 years ago

@jwilder , I added all the info you required. I am standing by if you need anything else.

rubycut commented 7 years ago

This is prior to crash: https://gist.github.com/rubycut/a30c2351664424189e312e168154629e

rubycut commented 7 years ago

Here is the pprof heap map as it is getting full:

https://gist.github.com/b036527bc1163a7f52b0de6fa7fd2a6b

rubycut commented 7 years ago

@jwilder , it seems that

defaultWaitingWALWrites = 10 could be bottleneck, look at:

https://github.com/influxdata/influxdb/blob/v1.1.0/tsdb/engine/tsm1/wal.go#L71

is it safe for me to increase this value and recompile?

jwilder commented 7 years ago

@rubycut Increasing it could cause higher memory usage. You may want to try lowering it since you have a lot of databases and shards.

rubycut commented 7 years ago

I can't get it to work, if I reduce it to 5, or increase it, it always grows as visible in this heatmap.

rubycut commented 7 years ago

I brought memory under control with cache-snapshot-memory-size = 2621440

Still, it crashes every two hours with:

influxd.log.7-panic: runtime error: invalid memory address or nil pointer dereference
influxd.log.7-[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x0]
influxd.log.7-
influxd.log.7-goroutine 92627 [running]:
influxd.log.7-panic(0x9cf120, 0xc4200120b0)
influxd.log.7-  /usr/local/go/src/runtime/panic.go:500 +0x1a1
influxd.log.7-created by github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Engine).enableLevelCompactions
influxd.log.7-  /root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:214 +0x113
influxd.log.7-[signal SIGSEGV: segmentation violation code=0x2 addr=0xc6634fb180 pc=0xc6634fb180]
influxd.log.7-
influxd.log.7-goroutine 39672 [running]:
influxd.log.7-runtime.throw(0xa7c76c, 0x5)
influxd.log.7-  /usr/local/go/src/runtime/panic.go:566 +0x95 fp=0xc7e9024e68 sp=0xc7e9024e48
influxd.log.7-runtime.sigpanic()
influxd.log.7-  /usr/local/go/src/runtime/sigpanic_unix.go:27 +0x288 fp=0xc7e9024ec0 sp=0xc7e9024e68
influxd.log.7-created by github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Engine).enableLevelCompactions
influxd.log.7-  /root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:215 +0x13f
influxd.log.7-
allenpetersen commented 7 years ago

I setup 500 databases each with their own user and a telegraf instance. Each database also has a retention policy with a 4 month duration and a 1 day shardGroupDuration (not ideal but trying to repro the original problem)

It has been running for 2 hours and hasn't crashed yet but I do see a lot of memory used. I'm at almost 12GB currently.

go tool pprof http://localhost:8086/debug/pprof/heap
Fetching profile from http://localhost:8086/debug/pprof/heap
Saved profile in /home/allen/pprof/pprof.localhost:8086.inuse_objects.inuse_space.013.pb.gz
Entering interactive mode (type "help" for commands)
(pprof) top10 
11.27GB of 11.32GB total (99.57%)
Dropped 564 nodes (cum <= 0.06GB)
Showing top 10 nodes out of 27 (cum >= 0.12GB)
      flat  flat%   sum%        cum   cum%
    5.02GB 44.40% 44.40%     5.02GB 44.40%  github.com/influxdata/influxdb/pkg/pool.(*Bytes).Get
    3.83GB 33.86% 78.26%     3.83GB 33.86%  github.com/influxdata/influxdb/tsdb/engine/tsm1.(*entry).add
    2.28GB 20.13% 98.39%    11.16GB 98.65%  github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Engine).WritePoints
    0.12GB  1.06% 99.45%     0.12GB  1.08%  github.com/influxdata/influxdb/services/httpd.(*Handler).serveWrite
    0.01GB  0.12% 99.57%     3.86GB 34.12%  github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Cache).write
         0     0% 99.57%     0.12GB  1.09%  github.com/bmizerany/pat.(*PatternServeMux).ServeHTTP
         0     0% 99.57%    11.18GB 98.78%  github.com/influxdata/influxdb/coordinator.(*PointsWriter).WritePoints.func1
         0     0% 99.57%    11.18GB 98.78%  github.com/influxdata/influxdb/coordinator.(*PointsWriter).writeToShard
         0     0% 99.57%     0.12GB  1.08%  github.com/influxdata/influxdb/services/httpd.(*Handler).(github.com/influxdata/influxdb/services/httpd.serveWrite)-fm
         0     0% 99.57%     0.12GB  1.09%  github.com/influxdata/influxdb/services/httpd.(*Handler).ServeHTTP
jwilder commented 7 years ago

@allenpetersen Can you try changing https://github.com/influxdata/influxdb/blob/master/tsdb/engine/tsm1/pools.go#L10 to use 10 instead of 1024?

I think the pkg/pool may need to free up some resources if they haven't been used or drop them if they are excessively large. Alternatively, the pool size should probably be much lower since the WAL uses a 4MB buffer by default.

The WAL bytes pool was switched from a sync.Pool to pkg/pool in #7348. Things in a sync.Pool will get garbaged collected whereas the pkg/pool will not.

rubycut commented 7 years ago

@allenpetersen , we don't use all the databases, so 170 databases would be enough. In each of these databases, we write from 5-100 servers, let's say 20 servers is average, so 3000 telegraf clients writing will be hard to simulate. Check into _internal database, https.pointsWrittenOK is around 2700 writes per second, that is the load you have to achieve to emulate.

With this load, heap grows constantly it starts from 7.5 Gb of memory, goes upto 26 Gb in 100 minutes. Which is around 200 Mb per minute. That is definitely more than expected, but since server has 64Gb of RAM, memory growth is not the only bug.

rubycut commented 7 years ago

After server was running for over an hour, I execute:

curl -s http://xxxx:8086/debug/pprof/heap > current.heap
sleep 120
curl -s http://xxx:8086/debug/pprof/heap > current2.heap
go tool pprof --base current.heap /workbench/bin/influxd current2.heap

And result is:

ROUTINE ======================== github.com/influxdata/influxdb/pkg/pool.(*Bytes).Get in /workbench/src/github.com/influxdata/influxdb/pkg/pool/bytes.go
  432.08MB   432.08MB (flat, cum)   100% of Total
         .          .     20:func (p *Bytes) Get(sz int) []byte {
         .          .     21:   var c []byte
         .          .     22:   select {
         .          .     23:   case c = <-p.pool:
         .          .     24:   default:
      -8MB       -8MB     25:           return make([]byte, sz)
         .          .     26:   }
         .          .     27:
         .          .     28:   if cap(c) < sz {
  440.08MB   440.08MB     29:           return make([]byte, sz)
         .          .     30:   }
         .          .     31:
         .          .     32:   return c[:sz]
         .          .     33:}
         .          .     34:

Notice that I didn't use -alloc_space option, which means this is extra memory taken in two minutes.

So, although it has a pool, it doesn't seems to use byte slices from pool, it uses make to allocate new ones.

I will also try now with pool size 10.

rubycut commented 7 years ago

@jwilder , by decreasing pool size, memory growth issue is completely eliminated. I set both pool size and defaultWaitingWALWrites to 24, that is the number of cores.

However, influxd is still crashing every two hours.

screen shot 2016-12-02 at 12 22 06

rubycut commented 7 years ago

@jwilder , do you want pull request for memory issue?

jwilder commented 7 years ago

@rubycut A pull request for the setting pool size to 10 would be good. For defaultWaitingWALWrites', maybe something like:defaultWaitingWALWrites = runtime.NumCPU()` would be more portable.

jwilder commented 7 years ago

@rubycut The pool allocations you see in the profile can occur if the slice returned from the pool does not have enough capacity for what was requested. For example, it got a 10mb slice, but a 25mb slice was requested so we allocate a new one. The new one should get added to the pool again when it's not in use. My guess is that a pool size of 1024 was eventually holding onto some very large slices which consumes a lot of memory.

rubycut commented 7 years ago

@jwilder , I'll make pull request now.

As far as crashes are concerned, which still continue to happen every two hours, just before influxd crashed last time, we had huge 5 second lockup:

screen shot 2016-12-02 at 18 15 37

jwilder commented 7 years ago

@rubycut I'm not sure what to make of that. That call trace is what happens at startup when loading shards. I'm not sure how you would see that just before it crashed.

BTW, how did you generate that image? What tool is that?

rubycut commented 7 years ago

@jwilder , you are right, this could already be startup since instance is automatically restarted with supervisor.

Tool is relatively new, https://stackimpact.com/

allenpetersen commented 7 years ago

@jwilder I've been digging into the memory usage. At least in this 500 database config the size of the requested buffers varies quite a bit resulting in a large amount of reallocating which prevents the pool from doing a good job.

I've been experimenting with making 4 different pools each with a different buffer size. All buffers in that pool are allocated with the same capacity so there isn't a need for reallocating. After analyzing the requested buffers here are the pool sizes that work well for this test config.

small      500KB
medium       1MB
large        4MB
XL     requested

I see a handful of 11MB and 13MB requests so I figured XL can just use whatever they requested.

Under the load test for 40 minutes I'm only seeing 151MB being retained by the pools in total. Most of the activity is in the small and large pools at about 50-100 requests per second each.

The XL pool only has 1 item and has only gotten accessed 189 times in 40 minutes. I've thought about not pooling the XL requests but it is only using up 13MB of RAM but it is preventing heap fragmentation by not having to reallocate that same block.

The medium pool got used a lot at startup time and then leveled off. I think that size must get used while reading in the cachloader.

Here are the pool stats after running for about 40 minutes under load

Size    Held (MB)  Requests
Small       14      382803
Medium       8      350591
Large      116       36119
XLarge      13         189
Total      151      769702
jwilder commented 7 years ago

@allenpetersen Interesting. Would you be able to try reverting the pool back to sync.Pool? This is the code that was there before: https://github.com/influxdata/influxdb/pull/7348/files#diff-25cc809730c7e22463531e3af8bf5dec

allenpetersen commented 7 years ago

I was looking at that code and I agree that sync.Pool might not be the best fit because the pools can get cleared often. The thrashing I was seeing with bufPool is gone with the bucketed sizes and Bytes.Get() is really low on the inuse_objects list. I've had influx under load for 70 minutes and it is quite happy, memory isn't growing.

Here is the WIP commit, there is a lot of logging code I've been using to monitor the pools

https://github.com/allenpetersen/influxdb/commit/cf76c3ee4f14055a4bb1a08dace9d3879f6c2721

I'm still looking into what seems like a heap fragmentation issue. Linux is reporting 30GB virtual and 11GB resident and there is 16GB of HeapReleased RAM.

# runtime.MemStats
# Alloc = 7887105264
# TotalAlloc = 125086394008
# Sys = 26991342520
# Lookups = 454114
# Mallocs = 1068203357
# Frees = 868244704
# HeapAlloc = 7887105264
# HeapSys = 25709838336
# HeapIdle = 16905437184
# HeapInuse = 8804401152
# HeapReleased = 15752298496
# HeapObjects = 199958653
# Stack = 32309248 / 32309248
# MSpan = 140084000 / 220659712
# MCache = 9600 / 16384
# BuckHashSys = 1721121
allenpetersen commented 7 years ago

@jwilder I quickly swapped out pool.Bytes with sync.Pool and it's performing really well.

https://github.com/allenpetersen/influxdb/commit/dd71b23513e42edd3ff25cfb918faac1491768a0

I'll get exact numbers here in a bit but for example with the older code it took almost 2 minutes to finish the cachloader stage of startup. With sync.Pool it took 15 seconds.

Linux is only reporting 10GB virtual and 4.4GB resident, each about 1/3rd of the older code.

# runtime.MemStats
# Alloc = 2372125720
# TotalAlloc = 32378802448
# Sys = 4349636208
# Lookups = 209778
# Mallocs = 291427402
# Frees = 235183224
# HeapAlloc = 2372125720
# HeapSys = 4078731264
# HeapIdle = 1594015744
# HeapInuse = 2484715520
# HeapReleased = 37691392
# HeapObjects = 56244178
# Stack = 34897920 / 34897920
# MSpan = 50029920 / 57786368
# MCache = 9600 / 16384
# BuckHashSys = 1627555
allenpetersen commented 7 years ago

@jwilder After 70 minutes with sync.Pool things continue to look really good. Overall it seems to be using about half of the memory as before.

Here are the results for a 70 minute test with sync.Pool buckets.

# runtime.MemStats
# Alloc = 4262116424
# TotalAlloc = 91533304016
# Sys = 6153510512
# Lookups = 779259
# Mallocs = 723575666
# Frees = 624965997
# HeapAlloc = 4262116424
# HeapSys = 5792759808
# HeapIdle = 1475100672
# HeapInuse = 4317659136
# HeapReleased = 0
# HeapObjects = 98609669
# Stack = 35028992 / 35028992
# MSpan = 67064000 / 75710464
# MCache = 9600 / 16384
# BuckHashSys = 1696723
# NextGC = 7056486574
# PauseNs = [530289 314619 320596 1096046 399833 308890 266669 471729 483568 400554 509909 550167 513533 301108 91926 68829 290793 56072 103335 109035 81930 103377 92215 52479 55663 106495 106711 69988 215513 222778 226723 52625 208723 56263 248848 225839 111737 269211 265993 80512 351578 256429 186419 198496 343178 462119 217230 2149745 460028 581279 726927 1155523 1180971 1252676 1633985 2201706 3144175 3008831 3424641 3293706 3314497 4204496 3700088 3804114 8579937 4113989 4518797 3754106 3912663 3819891 4386894 3742700 4217162 3866343 4455789 4059505 4565848 4185661 1892691 3760559 4036164 4194639 4772064 4188802 5763421 4162223 4569555 4053164 5886723 3843425 4263090 4310646 7016731 5057678 4185530 3967028 4009587 4095376 4807964 4115051 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
# NumGC = 100
# DebugGC = false

I'll do a similar run with 1.1.0 now.

allenpetersen commented 7 years ago

Here is a comparison of 1.1.0 to master+sync.Pool buckets

                 1.1.0       sync.Pool    delta
Alloc         7337573088    4262116424    58.09%
TotalAlloc  102558047400   91533304016    89.25%
Sys          10669295144    6153510512    57.67%
Lookups           748249        779259   104.14%
Mallocs        933161203     723575666    77.54%
Frees          767299281     624965997    81.45%
HeapAlloc     7337573088    4262116424    58.09%
HeapSys      10072588288    5792759808    57.51%
HeapIdle      2716852224    1475100672    54.29%
HeapInuse     7355736064    4317659136    58.70%
HeapReleased  1911767040             0          
HeapObjects    165861922      98609669    59.45%

To be honest I don't understand the 1.9GB of HeapReleased, my guess is something like a fragmented heap so there are not full pages the kernel can take back.

allenpetersen commented 7 years ago

@jwilder I ran a few more tests and using fixed size sync.Pools worked best. Unfortunately there is a lot of variability between the test runs. I want to make a repeatable workload with influx_stress to better compare the different mechanisms before making a PR.

e-dard commented 7 years ago

@rubycut @allenpetersen there are a couple of problems with the current pool in pkg/pool`.

@rubycut's fix in #7683 will help with memory management for systems under load with occasional large buffer requirements but the cost is increased allocations due to more contention over the now smaller pool.

@allenpetersen we originally used sync.Pool but I believe the issue with very high loads was that it was aggressively releasing items back to be GC and so the number of allocations was quite high. I think the tradeoff from going from a sync.Pool backed pool to a fixed channel backed pool was to reduce allocations with the cost of increased memory usage.

I've been working on a new pkg/pool which will tracked requested buffer sizes and dynamically release buffers from the pool if they're not appropriately sized. I'm hoping it will give us the best of both worlds (memory usage and allocations over time). Would you be interested in trying it out when the branch is ready?

allenpetersen commented 7 years ago

@e-dard What I noticed while trying to simulate the workload for this issue was a very high number of Get() requests failing the capacity check https://github.com/influxdata/influxdb/blob/master/pkg/pool/bytes.go#L28

    if cap(c) < sz {
        return make([]byte, sz)
    }

There were 200-400 calls to Bytes.Get() per second. Half of those requests were for small buffers of various sizes of less than 300k. The other half were for a single size of 1048576 bytes. In a very short period of time buffers in the pool channel were essentially randomly sized resulting in a lot of reallocations when the capacity check failed.

Looking back at the sync.Pool implementation from September there was similar logic. My guess is the problem was more about discarding these pooled buffers when the capacity check fails rather than a GC emptying the pool. Maintaining a few pools of specific sized buffers removed this thrashing and the allocations went way down. Actually the total memory in all the pools went down quite a bit as well.

I've been busy this week and haven't been able to simplify my test procedure. I've looked at stress and stress/v2 but this workload is rather specific and I think it will be easier to just write a simple program to generate the load.

Let me know when your new branch is ready, I'm happy to test it.

allenpetersen commented 7 years ago

@e-dard Here is the latest version of my experiment pools of specific sizes.

https://github.com/allenpetersen/influxdb/commit/7a86d160ac2bba3e97c2ddcdf14c7e108eddac6d

[edit] updated to latest commit

rubycut commented 7 years ago

@e-dard, @allenpetersen , reducing pool size with my patch solved memory issues for us, but if you want that we test something, I am available.