Closed rubycut closed 7 years ago
At the time of crash, there was 114k goroutines in the system:
grep "goroutine " error2.log|wc -l
114551
Roughly half of them were stuck at the same place:
grep engine.go:979 error2.log|wc -l
56118
goroutine 19985 [sleep]:
time.Sleep(0x3b9aca00)
/usr/local/go/src/runtime/time.go:59 +0xe1
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Engine).compactTSMLevel(0xc429826100, 0xb74601, 0x1, 0xc428dd2300)
/root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:979 +0x90
@rubycut Can you attach the full panic? The two links provided cut off the actual cause of the panic.
Can you also run the following and attach to the issue:
curl -o block.txt "http://localhost:8086/debug/pprof/block?debug=1"
curl -o goroutine.txt "http://localhost:8086/debug/pprof/goroutine?debug=1"
curl -o heap.txt "http://localhost:8086/debug/pprof/heap?debug=1"
curl -o vars.txt "http://localhost:8086/debug/vars"
iostat -xd 1 30 > iostat.txt
uname -a > uname.txt
influx -execute "show shards" > shards.txt
influx -execute "show stats" > stats.txt
influx -execute "show diagnostics" > diagnostics.txt
uname -a
Linux dedxxxx 3.10.95xx #30 SMP Fri Jan 29 11:54:46 CST 2016 x86_64 GNU/Linux
diagnostics:
name: build
-----------
Branch Build Time Commit Version
master 800da5732b91c816b0a097acf8887fa2af1efa1a 1.1.0
name: network
-------------
hostname
dedxxxx
name: runtime
-------------
GOARCH GOMAXPROCS GOOS version
amd64 24 linux go1.7.3
name: system
------------
PID currentTime started uptime
5324 2016-11-18T20:27:10.806259576Z 2016-11-18T20:01:05.251612142Z 26m5.554647524s
@rubycut Can you attach show retention policies
as well? The number of goroutines is high because it looks like you have 1 day shard group durations for all your shards but fairly large retention range. You also have a large number of databases as well.
@jwilder , show retention policies
doesn't work, database must be selected first, so in all databases, retention policy is the same, it looks like this:
> show retention policies
name duration shardGroupDuration replicaN default
default 0s 168h0m0s 1 false
four_months 3000h0m0s 24h0m0s 1 true
@jwilder , I attached full panic, there is nothing important before panic, influxdb was writing, business as usual:
[httpd] <ip_hidden> - c3941_telegraf [18/Nov/2016:12:16:03 +0000] "POST /write?consistency=&db=c3941&precision=ns&rp= HTTP/1.1" 204 0 "-" "InfluxDBClient" c6823a4f-ad88-11e6-9ab6-000000000000 98278
[httpd] <ip_hidden> - c3503_telegraf [18/Nov/2016:12:16:04 +0000] "GET /query?db=&q=CREATE+DATABASE+IF+NOT+EXISTS+%22c3503%22 HTTP/1.1" 400 94 "-" "InfluxDBClient" c6939fb5-ad88-11e6-9ac5-000000000000 993
[httpd] <ip_hidden> - c3439_telegraf [18/Nov/2016:12:16:03 +0000] "POST /write?consistency=&db=c3439&precision=ns&rp= HTTP/1.1" 204 0 "-" "InfluxDBClient" c6870f8e-ad88-11e6-9ab9-000000000000 85296
[httpd] <ip_hidden> - c3623_telegraf [18/Nov/2016:12:16:04 +0000] "POST /write?consistency=&db=c3623&precision=ns&rp= HTTP/1.1" 204 0 "-" "InfluxDBClient" c693d5fa-ad88-11e6-9ac6-000000000000 2071
unexpected fault address 0x0
fatal error: fault
[signal SIGSEGV: segmentation violation code=0x80 addr=0x0 pc=0x40ea9d]
curl "http://localhost:8086/debug/pprof/goroutine?debug=1"
404 page not found
pprof is not working, I am using stock influxdb 1.1.0 downloaded from your side, do I need to recompile or use anything else to turn pprof on?
@rubycut Can you attach the full trace? That tells me what the panic was, but not where it occurred or what else was going on at the time.
For your shards, I'd recommend using a larger shard group duration. For a four month retention period, a shard group duration of 1 mon would significantly reduce the number of shards and goroutines running on your system. The number of databases compounds the problem as well. The shards don't look very big so smaller shard durations is likely just creating overhead for little benefit.
It looks like you may be using a custom kernel as well which might be having problems with the number of goroutines/shards/etc.. or other dependent system resources associated with them.
You may need to set:
[http]
pprof-enabled = true
@jwilder , when influxdb crashes, it writes state in the log, this is what I already sent you, I don't know any other way how to obtain full trace.
@rubycut The two dropbox links only have the end of the trace. The beginning part is cut off which doesn't indicate what the panic was or where it occurred. This is all I see:
x197
goroutine 119256 [sleep]:
time.Sleep(0x3b9aca00)
/usr/local/go/src/runtime/time.go:59 +0xe1
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Engine).compactCache(0xc6aeb83800, 0xc727eece40)
/root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:953 +0xd7
github.com/influxdata/influxdb/tsdb/engine/tsm1.
... rest of trace...
oh man, dropbox sucks, here is the gist of trace:
https://gist.github.com/rubycut/d35daf01887b5d23ddd4571c500c8e64
As we changed shardGroupDuration from one day to two weeks, number of goroutines is going down, we are at 111k goroutines, they go down by 1k each day.
Today, influxdb was running for 4.5 hours without going down.
But we are observing strange memory behavior compared with influxdb version 0.13.
Blue line is Influxdb 1.1.0, while green is influxdb 0.13, both servers are running same load.
As you can see influxdb 1.1.0 has constantly rising memory footprint, and in many cases I suspect that memory could be cause of crash, although server doesn't seem to die from OOM killer.
Sudden drop in blue line every couple of hours when influxdb crashes.
This is after running for 50 minutes:
goroutine.txt: https://gist.github.com/995cd72876f8638447fab50af25995d5 block.txt: https://gist.github.com/d3b83295fa62e931af14f77298d57427 heap.txt: https://gist.github.com/595f25f3f77f3e597e060ec9d73a314b
@jwilder , I added all the info you required. I am standing by if you need anything else.
This is prior to crash: https://gist.github.com/rubycut/a30c2351664424189e312e168154629e
Here is the pprof heap map as it is getting full:
@jwilder , it seems that
defaultWaitingWALWrites = 10 could be bottleneck, look at:
https://github.com/influxdata/influxdb/blob/v1.1.0/tsdb/engine/tsm1/wal.go#L71
is it safe for me to increase this value and recompile?
@rubycut Increasing it could cause higher memory usage. You may want to try lowering it since you have a lot of databases and shards.
I can't get it to work, if I reduce it to 5, or increase it, it always grows as visible in this heatmap.
I brought memory under control with cache-snapshot-memory-size = 2621440
Still, it crashes every two hours with:
influxd.log.7-panic: runtime error: invalid memory address or nil pointer dereference
influxd.log.7-[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x0]
influxd.log.7-
influxd.log.7-goroutine 92627 [running]:
influxd.log.7-panic(0x9cf120, 0xc4200120b0)
influxd.log.7- /usr/local/go/src/runtime/panic.go:500 +0x1a1
influxd.log.7-created by github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Engine).enableLevelCompactions
influxd.log.7- /root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:214 +0x113
influxd.log.7-[signal SIGSEGV: segmentation violation code=0x2 addr=0xc6634fb180 pc=0xc6634fb180]
influxd.log.7-
influxd.log.7-goroutine 39672 [running]:
influxd.log.7-runtime.throw(0xa7c76c, 0x5)
influxd.log.7- /usr/local/go/src/runtime/panic.go:566 +0x95 fp=0xc7e9024e68 sp=0xc7e9024e48
influxd.log.7-runtime.sigpanic()
influxd.log.7- /usr/local/go/src/runtime/sigpanic_unix.go:27 +0x288 fp=0xc7e9024ec0 sp=0xc7e9024e68
influxd.log.7-created by github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Engine).enableLevelCompactions
influxd.log.7- /root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:215 +0x13f
influxd.log.7-
I setup 500 databases each with their own user and a telegraf instance. Each database also has a retention policy with a 4 month duration and a 1 day shardGroupDuration (not ideal but trying to repro the original problem)
It has been running for 2 hours and hasn't crashed yet but I do see a lot of memory used. I'm at almost 12GB currently.
go tool pprof http://localhost:8086/debug/pprof/heap
Fetching profile from http://localhost:8086/debug/pprof/heap
Saved profile in /home/allen/pprof/pprof.localhost:8086.inuse_objects.inuse_space.013.pb.gz
Entering interactive mode (type "help" for commands)
(pprof) top10
11.27GB of 11.32GB total (99.57%)
Dropped 564 nodes (cum <= 0.06GB)
Showing top 10 nodes out of 27 (cum >= 0.12GB)
flat flat% sum% cum cum%
5.02GB 44.40% 44.40% 5.02GB 44.40% github.com/influxdata/influxdb/pkg/pool.(*Bytes).Get
3.83GB 33.86% 78.26% 3.83GB 33.86% github.com/influxdata/influxdb/tsdb/engine/tsm1.(*entry).add
2.28GB 20.13% 98.39% 11.16GB 98.65% github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Engine).WritePoints
0.12GB 1.06% 99.45% 0.12GB 1.08% github.com/influxdata/influxdb/services/httpd.(*Handler).serveWrite
0.01GB 0.12% 99.57% 3.86GB 34.12% github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Cache).write
0 0% 99.57% 0.12GB 1.09% github.com/bmizerany/pat.(*PatternServeMux).ServeHTTP
0 0% 99.57% 11.18GB 98.78% github.com/influxdata/influxdb/coordinator.(*PointsWriter).WritePoints.func1
0 0% 99.57% 11.18GB 98.78% github.com/influxdata/influxdb/coordinator.(*PointsWriter).writeToShard
0 0% 99.57% 0.12GB 1.08% github.com/influxdata/influxdb/services/httpd.(*Handler).(github.com/influxdata/influxdb/services/httpd.serveWrite)-fm
0 0% 99.57% 0.12GB 1.09% github.com/influxdata/influxdb/services/httpd.(*Handler).ServeHTTP
@allenpetersen Can you try changing https://github.com/influxdata/influxdb/blob/master/tsdb/engine/tsm1/pools.go#L10 to use 10
instead of 1024
?
I think the pkg/pool
may need to free up some resources if they haven't been used or drop them if they are excessively large. Alternatively, the pool size should probably be much lower since the WAL uses a 4MB buffer by default.
The WAL bytes pool was switched from a sync.Pool
to pkg/pool
in #7348. Things in a sync.Pool
will get garbaged collected whereas the pkg/pool
will not.
@allenpetersen , we don't use all the databases, so 170 databases would be enough. In each of these databases, we write from 5-100 servers, let's say 20 servers is average, so 3000 telegraf clients writing will be hard to simulate. Check into _internal database, https.pointsWrittenOK is around 2700 writes per second, that is the load you have to achieve to emulate.
With this load, heap grows constantly it starts from 7.5 Gb of memory, goes upto 26 Gb in 100 minutes. Which is around 200 Mb per minute. That is definitely more than expected, but since server has 64Gb of RAM, memory growth is not the only bug.
After server was running for over an hour, I execute:
curl -s http://xxxx:8086/debug/pprof/heap > current.heap
sleep 120
curl -s http://xxx:8086/debug/pprof/heap > current2.heap
go tool pprof --base current.heap /workbench/bin/influxd current2.heap
And result is:
ROUTINE ======================== github.com/influxdata/influxdb/pkg/pool.(*Bytes).Get in /workbench/src/github.com/influxdata/influxdb/pkg/pool/bytes.go
432.08MB 432.08MB (flat, cum) 100% of Total
. . 20:func (p *Bytes) Get(sz int) []byte {
. . 21: var c []byte
. . 22: select {
. . 23: case c = <-p.pool:
. . 24: default:
-8MB -8MB 25: return make([]byte, sz)
. . 26: }
. . 27:
. . 28: if cap(c) < sz {
440.08MB 440.08MB 29: return make([]byte, sz)
. . 30: }
. . 31:
. . 32: return c[:sz]
. . 33:}
. . 34:
Notice that I didn't use -alloc_space
option, which means this is extra memory taken in two minutes.
So, although it has a pool, it doesn't seems to use byte slices from pool, it uses make
to allocate new ones.
I will also try now with pool size 10.
@jwilder , by decreasing pool size, memory growth issue is completely eliminated. I set both pool size and defaultWaitingWALWrites to 24, that is the number of cores.
However, influxd is still crashing every two hours.
@jwilder , do you want pull request for memory issue?
@rubycut A pull request for the setting pool size to 10 would be good. For defaultWaitingWALWrites', maybe something like:
defaultWaitingWALWrites = runtime.NumCPU()` would be more portable.
@rubycut The pool allocations you see in the profile can occur if the slice returned from the pool does not have enough capacity for what was requested. For example, it got a 10mb slice, but a 25mb slice was requested so we allocate a new one. The new one should get added to the pool again when it's not in use. My guess is that a pool size of 1024 was eventually holding onto some very large slices which consumes a lot of memory.
@jwilder , I'll make pull request now.
As far as crashes are concerned, which still continue to happen every two hours, just before influxd crashed last time, we had huge 5 second lockup:
@rubycut I'm not sure what to make of that. That call trace is what happens at startup when loading shards. I'm not sure how you would see that just before it crashed.
BTW, how did you generate that image? What tool is that?
@jwilder , you are right, this could already be startup since instance is automatically restarted with supervisor.
Tool is relatively new, https://stackimpact.com/
@jwilder I've been digging into the memory usage. At least in this 500 database config the size of the requested buffers varies quite a bit resulting in a large amount of reallocating which prevents the pool from doing a good job.
I've been experimenting with making 4 different pools each with a different buffer size. All buffers in that pool are allocated with the same capacity so there isn't a need for reallocating. After analyzing the requested buffers here are the pool sizes that work well for this test config.
small 500KB
medium 1MB
large 4MB
XL requested
I see a handful of 11MB and 13MB requests so I figured XL can just use whatever they requested.
Under the load test for 40 minutes I'm only seeing 151MB being retained by the pools in total. Most of the activity is in the small and large pools at about 50-100 requests per second each.
The XL pool only has 1 item and has only gotten accessed 189 times in 40 minutes. I've thought about not pooling the XL requests but it is only using up 13MB of RAM but it is preventing heap fragmentation by not having to reallocate that same block.
The medium pool got used a lot at startup time and then leveled off. I think that size must get used while reading in the cachloader.
Here are the pool stats after running for about 40 minutes under load
Size Held (MB) Requests
Small 14 382803
Medium 8 350591
Large 116 36119
XLarge 13 189
Total 151 769702
@allenpetersen Interesting. Would you be able to try reverting the pool back to sync.Pool
? This is the code that was there before: https://github.com/influxdata/influxdb/pull/7348/files#diff-25cc809730c7e22463531e3af8bf5dec
I was looking at that code and I agree that sync.Pool might not be the best fit because the pools can get cleared often. The thrashing I was seeing with bufPool is gone with the bucketed sizes and Bytes.Get()
is really low on the inuse_objects
list. I've had influx under load for 70 minutes and it is quite happy, memory isn't growing.
Here is the WIP commit, there is a lot of logging code I've been using to monitor the pools
https://github.com/allenpetersen/influxdb/commit/cf76c3ee4f14055a4bb1a08dace9d3879f6c2721
I'm still looking into what seems like a heap fragmentation issue. Linux is reporting 30GB virtual and 11GB resident and there is 16GB of HeapReleased RAM.
# runtime.MemStats
# Alloc = 7887105264
# TotalAlloc = 125086394008
# Sys = 26991342520
# Lookups = 454114
# Mallocs = 1068203357
# Frees = 868244704
# HeapAlloc = 7887105264
# HeapSys = 25709838336
# HeapIdle = 16905437184
# HeapInuse = 8804401152
# HeapReleased = 15752298496
# HeapObjects = 199958653
# Stack = 32309248 / 32309248
# MSpan = 140084000 / 220659712
# MCache = 9600 / 16384
# BuckHashSys = 1721121
@jwilder I quickly swapped out pool.Bytes
with sync.Pool
and it's performing really well.
https://github.com/allenpetersen/influxdb/commit/dd71b23513e42edd3ff25cfb918faac1491768a0
I'll get exact numbers here in a bit but for example with the older code it took almost 2 minutes to finish the cachloader stage of startup. With sync.Pool
it took 15 seconds.
Linux is only reporting 10GB virtual and 4.4GB resident, each about 1/3rd of the older code.
# runtime.MemStats
# Alloc = 2372125720
# TotalAlloc = 32378802448
# Sys = 4349636208
# Lookups = 209778
# Mallocs = 291427402
# Frees = 235183224
# HeapAlloc = 2372125720
# HeapSys = 4078731264
# HeapIdle = 1594015744
# HeapInuse = 2484715520
# HeapReleased = 37691392
# HeapObjects = 56244178
# Stack = 34897920 / 34897920
# MSpan = 50029920 / 57786368
# MCache = 9600 / 16384
# BuckHashSys = 1627555
@jwilder After 70 minutes with sync.Pool
things continue to look really good. Overall it seems to be using about half of the memory as before.
Here are the results for a 70 minute test with sync.Pool
buckets.
# runtime.MemStats
# Alloc = 4262116424
# TotalAlloc = 91533304016
# Sys = 6153510512
# Lookups = 779259
# Mallocs = 723575666
# Frees = 624965997
# HeapAlloc = 4262116424
# HeapSys = 5792759808
# HeapIdle = 1475100672
# HeapInuse = 4317659136
# HeapReleased = 0
# HeapObjects = 98609669
# Stack = 35028992 / 35028992
# MSpan = 67064000 / 75710464
# MCache = 9600 / 16384
# BuckHashSys = 1696723
# NextGC = 7056486574
# PauseNs = [530289 314619 320596 1096046 399833 308890 266669 471729 483568 400554 509909 550167 513533 301108 91926 68829 290793 56072 103335 109035 81930 103377 92215 52479 55663 106495 106711 69988 215513 222778 226723 52625 208723 56263 248848 225839 111737 269211 265993 80512 351578 256429 186419 198496 343178 462119 217230 2149745 460028 581279 726927 1155523 1180971 1252676 1633985 2201706 3144175 3008831 3424641 3293706 3314497 4204496 3700088 3804114 8579937 4113989 4518797 3754106 3912663 3819891 4386894 3742700 4217162 3866343 4455789 4059505 4565848 4185661 1892691 3760559 4036164 4194639 4772064 4188802 5763421 4162223 4569555 4053164 5886723 3843425 4263090 4310646 7016731 5057678 4185530 3967028 4009587 4095376 4807964 4115051 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
# NumGC = 100
# DebugGC = false
I'll do a similar run with 1.1.0 now.
Here is a comparison of 1.1.0 to master+sync.Pool buckets
1.1.0 sync.Pool delta
Alloc 7337573088 4262116424 58.09%
TotalAlloc 102558047400 91533304016 89.25%
Sys 10669295144 6153510512 57.67%
Lookups 748249 779259 104.14%
Mallocs 933161203 723575666 77.54%
Frees 767299281 624965997 81.45%
HeapAlloc 7337573088 4262116424 58.09%
HeapSys 10072588288 5792759808 57.51%
HeapIdle 2716852224 1475100672 54.29%
HeapInuse 7355736064 4317659136 58.70%
HeapReleased 1911767040 0
HeapObjects 165861922 98609669 59.45%
To be honest I don't understand the 1.9GB of HeapReleased, my guess is something like a fragmented heap so there are not full pages the kernel can take back.
@jwilder I ran a few more tests and using fixed size sync.Pools worked best. Unfortunately there is a lot of variability between the test runs. I want to make a repeatable workload with influx_stress
to better compare the different mechanisms before making a PR.
@rubycut @allenpetersen there are a couple of problems with the current pool in pkg/pool`.
@rubycut's fix in #7683 will help with memory management for systems under load with occasional large buffer requirements but the cost is increased allocations due to more contention over the now smaller pool.
@allenpetersen we originally used sync.Pool
but I believe the issue with very high loads was that it was aggressively releasing items back to be GC and so the number of allocations was quite high. I think the tradeoff from going from a sync.Pool
backed pool to a fixed channel backed pool was to reduce allocations with the cost of increased memory usage.
I've been working on a new pkg/pool
which will tracked requested buffer sizes and dynamically release buffers from the pool if they're not appropriately sized. I'm hoping it will give us the best of both worlds (memory usage and allocations over time). Would you be interested in trying it out when the branch is ready?
@e-dard What I noticed while trying to simulate the workload for this issue was a very high number of Get() requests failing the capacity check https://github.com/influxdata/influxdb/blob/master/pkg/pool/bytes.go#L28
if cap(c) < sz {
return make([]byte, sz)
}
There were 200-400 calls to Bytes.Get() per second. Half of those requests were for small buffers of various sizes of less than 300k. The other half were for a single size of 1048576 bytes. In a very short period of time buffers in the pool channel were essentially randomly sized resulting in a lot of reallocations when the capacity check failed.
Looking back at the sync.Pool implementation from September there was similar logic. My guess is the problem was more about discarding these pooled buffers when the capacity check fails rather than a GC emptying the pool. Maintaining a few pools of specific sized buffers removed this thrashing and the allocations went way down. Actually the total memory in all the pools went down quite a bit as well.
I've been busy this week and haven't been able to simplify my test procedure. I've looked at stress and stress/v2 but this workload is rather specific and I think it will be easier to just write a simple program to generate the load.
Let me know when your new branch is ready, I'm happy to test it.
@e-dard Here is the latest version of my experiment pools of specific sizes.
https://github.com/allenpetersen/influxdb/commit/7a86d160ac2bba3e97c2ddcdf14c7e108eddac6d
[edit] updated to latest commit
@e-dard, @allenpetersen , reducing pool size with my patch solved memory issues for us, but if you want that we test something, I am available.
Bug report
System info: Influxdb 1.1.0 Os: Debian
Steps to reproduce:
Expected behavior:
Should run normally
Actual behavior:
Crashes after two hours