Huge amount of context switches, growing over time

okket commented 9 years ago

When I let ipfs daemon run for about 48 hours my system starts to show latency issues due to a huge amount of context switches. They slowly grow over time and are directly attributable to ipfs (the problem does not exist without running the daemon).

Here is a nice graph to illustrate: interrupts-week

Every other metric, like packets per second (about 300), connections (about 50), memory usage, etc. does not grow out of bounds.

Guesswork: With vmstat 1 I see nearly exactly ever 10 seconds a huge spike in context switches, currently about 20k. I am somewhat sure that this is daemon.

I'll update this issue tomorrow with fresh numbers and hopefully a dump of the daemon.

whyrusleeping commented 9 years ago

So, the next time you see this hapening, if i could get the output of ps -eLf | grep ipfs and also the stack dump from pressing Ctrl + \

okket commented 9 years ago

ps -eLf | grep ipfs yields:

ipfs     10438 29363 10438  0    1 05:02 pts/0    00:00:00 ps -eLf
ipfs     10439 29363 10439  0    1 05:02 pts/0    00:00:00 grep --color=auto -i ipfs
ipfs     22072     1 22072  0    1 Apr07 ?        00:00:01 tmux
ipfs     22073 22072 22073  0    1 Apr07 pts/1    00:00:00 -bash
ipfs     22167 22073 22167  0  117 Apr07 pts/1    00:00:00 ipfs daemon
[...]
ipfs     22167 22073 22317  0  117 Apr07 pts/1    00:00:00 ipfs daemon

I can get a core dump of the running process with gcore -o <file> <pid>. Does this help? Or do you need the dump of the terminated process? The latter is a bit more tricky, the environment options did not allow core dumps when the process was started. The best idea I have is to somehow halt the process (with a signal?) and then gcore.

okket commented 9 years ago

A 1-2 minute strace summary:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 72.33    9.250829          96     96790     31186 futex
 22.64    2.895659         119     24235           epoll_wait
  3.81    0.487598          14     35409           select
  0.63    0.080998           3     30416      2570 read
  0.43    0.054642           3     16720           write
  0.11    0.013771        6886         2         1 restart_syscall
  0.05    0.006987           2      3154           sched_yield
------ ----------- ----------- --------- --------- ----------------
100.00   12.790484                206726     33757 total

A short system call trace of one of the threads:

22168      0.000047 restart_syscall(<... resuming interrupted call ...> <unfinished ...>
22168      0.000041 <... restart_syscall resumed> ) = 0
22168      0.000041 select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
22168      0.000026 <... select resumed> ) = 0 (Timeout)
22168      0.000026 select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
22168      0.000106 <... select resumed> ) = 0 (Timeout)
22168      0.000042 select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
22168      0.000051 <... select resumed> ) = 0 (Timeout)
22168      0.000041 futex(0x1083a40, FUTEX_WAIT, 0, {60, 0} <unfinished ...>
22168      0.000047 <... futex resumed> ) = 0
22168      0.000047 select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
22168      0.000057 <... select resumed> ) = 0 (Timeout)
[...]

whyrusleeping commented 9 years ago

if you send the process SIGQUIT it will dump all the goroutine stacks.

But thats a lot of threads, and quite the large number of calls to futex, epoll_wait and select... Looks to me like a goroutine leak

jbenet commented 9 years ago

I'll push a changeset that adds an option to disable reuseport. would be good to see the numbers then.

okket commented 9 years ago

Sounds great. I'll gladly try that changeset and report.

I also managed to get the last 2k lines of the kill -SIGQUIT output out of the tmux session. The IPFS hash for the file is QmYGvyL3ZUQaQqwk9Zc4uUrpxWG42gh7X86UhRPeeU92U4

Here is an updated graph which shows the daemon restart interrupts-week 2

jbenet commented 9 years ago

@okket changeset available in the branch for #1038 -- it will be merged soon too.

okket commented 9 years ago

Thank you for merging. I'm running the master branch now with IPFS_REUSEPORT=false ipfs daemon, but I am not sure the new code gets called. The small typo (log.Infof, see https://github.com/ipfs/go-ipfs/pull/1042) did not error. Consequently I do not see the confirmation message in the logs.

okket commented 9 years ago

Ok, my reasoning was, hmm, just wrong. Sorry. Since I am not able to catch the log message for some reason I just changed the log_Infof into a print in my test setup. The code is called and hopefully the daemon now does not eat up resources over time. I'll report in a day or two.

Still I find the amount of system calls the daemon emits a bit alarming:

$ timeout 60 strace -c -f -p $(pidof ipfs)
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 72.08    3.705090         103     36124      3984 futex
 19.80    1.017752          38     26701           epoll_wait
  6.25    0.321034          18     17715           select
  1.03    0.053204           2     25965      2644 read
  0.75    0.038429           3     15155           write
  0.08    0.004328           3      1562           sched_yield
  0.01    0.000512          73         7           madvise
  0.00    0.000167          84         2         1 restart_syscall
------ ----------- ----------- --------- --------- ----------------
100.00    5.140516                123231      6629 total

whyrusleeping commented 9 years ago

@okket that is pretty concerning. the futex is goroutine scheduling going on, and it implies we probably have too many goroutines. epoll_wait is typically in the network stack, i know that the REUSEPORT code makes a large number of calls to it, go/net probably does as well.

At any rate, i think some profiling needs to be done. and eventlogs need to be turned off.

okket commented 9 years ago

Preliminary results indicate that current master with IPFS_REUSEPORT=false ipfs daemon is still generating more context switches over time:

interrupts-week 3

okket commented 9 years ago

I forgot to mention: Because Fedora does not include P-224 (for patent reasons I guess) I had to remove support from ipfs:

diff --git a/p2p/crypto/key.go b/p2p/crypto/key.go
index 7f0a13a..31dfeb3 100644
--- a/p2p/crypto/key.go
+++ b/p2p/crypto/key.go
@@ -102,8 +102,8 @@ func GenerateEKeyPair(curveName string) ([]byte, GenSharedKey, error) {
        var curve elliptic.Curve

        switch curveName {
-       case "P-224":
-               curve = elliptic.P224()
+//     case "P-224":
+//             curve = elliptic.P224()
        case "P-256":
                curve = elliptic.P256()
        case "P-384":
diff --git a/p2p/crypto/secio/al.go b/p2p/crypto/secio/al.go
index 9a0cadf..9e0ffce 100644
--- a/p2p/crypto/secio/al.go
+++ b/p2p/crypto/secio/al.go
@@ -18,7 +18,7 @@ import (
 )

 // List of supported ECDH curves
-var SupportedExchanges = "P-256,P-224,P-384,P-521"
+var SupportedExchanges = "P-256,P-384,P-521"

 // List of supported Ciphers
 var SupportedCiphers = "AES-256,AES-128,Blowfish"

I hope this change does not in a crazy way I can not possibly imagine create a feedback loop or trigger another thing that is responsible for the growing context switch problem.

whyrusleeping commented 9 years ago

I dont see why fedora not having it means you have to remove it from ipfs, p224 is a go library.

whyrusleeping commented 9 years ago

Also, i have two fedora boxes running ipfs without modification just fine. You may be running an older version of Go?

okket commented 9 years ago

Nope:

$ yum info golang
Loaded plugins: langpacks
Installed Packages
Name        : golang
Arch        : x86_64
Version     : 1.4.2
Release     : 2.fc21
Size        : 10 M
Repo        : installed
From repo   : updates
Summary     : The Go Programming Language
URL         : http://golang.org/
License     : BSD
Description : The Go Programming Language.

The build fails with P-224 support:

$ make build
cd cmd/ipfs && go build -i
# github.com/ipfs/go-ipfs/p2p/crypto
../../../gocode/src/github.com/ipfs/go-ipfs/p2p/crypto/key.go:106: undefined: elliptic.P224
Makefile:27: recipe for target 'build' failed
make: *** [build] Error 2

whyrusleeping commented 9 years ago

can you also run go version for me? I get the following:

[jeromy@localhost ~]$ lsb_release -a
LSB Version:    :core-4.1-amd64:core-4.1-noarch
Distributor ID: Fedora
Description:    Fedora release 20 (Heisenbug)
Release:    20
Codename:   Heisenbug
[jeromy@localhost ~]$ uname -a
Linux localhost.localdomain 3.17.8-200.fc20.x86_64 #1 SMP Thu Jan 8 23:26:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
[jeromy@localhost ~]$ go version
go version go1.4.2 linux/amd64
[jeromy@localhost go-ipfs]$ make build
cd cmd/ipfs && go build -i
[jeromy@localhost go-ipfs]$

okket commented 9 years ago

$ go version
go version go1.4.2 linux/amd64

Support for P-224 was removed in 2013 (1), Fedora 20 does not ship with it, see the golang-1.2-remove-ECC-p224.patch here:

http://koji.fedoraproject.org/koji/rpminfo?rpmID=6091693

Are you sure you use the system installed go like me? No $GOROOT? Also:

$ which go
/bin/go

(1) https://lists.fedoraproject.org/pipermail/golang/2013-December/000596.html

okket commented 9 years ago

I'll install the official binary go distribution locally to get this side issue out of the way.

whyrusleeping commented 9 years ago

Ah, I always install from source. that is really weird that fedora ships a patched go.

whyrusleeping commented 9 years ago

It appears that the fedora maintainers are paranoid, bruce schneier quote incoming:

"On the crypto bits in your guardian piece, I found especially interesting that you suggest classic discrete log crypto over ecc. I want to ask if you could elaborate more on that."

I no longer trust the constants. I believe the NSA has manipulated them through their relationships with industry.

https://www.schneier.com/blog/archives/2013/09/the_nsa_is_brea.html#c1675929

whyrusleeping commented 9 years ago

https://bugzilla.redhat.com/show_bug.cgi?id=1210447

jbenet commented 9 years ago

@whyrusleeping agreed. this should be a change in Go.

okket commented 9 years ago

Unchanged master, official go binaries, started with IPFS_REUSEPORT=false ipfs daemon, still growing:

interrupts-week 4

okket commented 9 years ago

Maybe a hint: If you look at these graphs and numbers, you see that the timer interrupts (green) are growing linearly (on a logarithmic scale!) up to the maximum of 716, which corresponds to 10k context switches. Nothing else on my server uses this much timers, but the real scary fact (and why I keep on pressing this) is the linear growing. This is not natural, most likely will not stop until server crash and can (IMHO) only be explained by some kind of 'leak'.

I will look into this with debug ideas from http://elinux.org/Kernel_Timer_Systems after my system has finished zfs scrub (hello ZoL 0.6.4).

irqstats-week

whyrusleeping commented 9 years ago

okket, could you run ipfs with IPFS_PROF set to true? then ctrl+c the daemon when context switches get really high? I want both the cpuprofile and memprofile it outputs

whyrusleeping commented 9 years ago

Also, thanks a ton for helping us debug this issue!!

okket commented 9 years ago

Please ignore my last posting above. The interrupt count on my system is not reliable:

This system BIOS has enabled interrupt remapping on a chipset that contains an erratum
making that feature unstable. To maintain system stability interrupt remapping is being
disabled.  Please contact your BIOS vendor for an update

(Sadly there is no BIOS update available)

I doubt that this causes the high context switch problem. The system is very reliable otherwise. I'll insert a complete SIGQUIT dump when it reaches 10k or so.

whyrusleeping commented 9 years ago

(also, thanks for the heads up on ZoL 0.6.4)

okket commented 9 years ago

~12k context switches per second again, growing linearly as usual:

interrupts-week 3

60 second strace:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 79.78   18.854636          67    282022     96950 futex
 16.43    3.884003          82     47561         1 epoll_wait
  2.86    0.675084           7     91441        40 select
  0.43    0.101768           2     50400      4397 read
  0.33    0.077317           3     27993           write
  0.13    0.029989           1     20670           sched_yield
  0.04    0.009583         266        36        24 restart_syscall
  0.00    0.000285           1       563        70 rt_sigreturn
------ ----------- ----------- --------- --------- ----------------
100.00   23.632665                520686    101482 total

Now killed with SIGQUIT, here are the debug files:

ipfs.out QmPh7Srs3W1ChrB9Bwfaz9cfxKDfbk7r2enGHwZ5etKChJ ipfs.cpuprof QmePD76ngy2cLJYm4FhzDqY5HXo4WY7Voxzb8nTq8xsn7t ipfs.memprof QmS6Qb94DEEwH3USaJdUn1BoSzYvpf74envHGyEpnyNWt1

jbenet commented 9 years ago

is this with master?

On Mon, Apr 13, 2015 at 2:19 AM, Okke Timm notifications@github.com wrote:

~12k context switches per second again, growing linearly as usual:

[image: interrupts-week 3] https://cloud.githubusercontent.com/assets/208337/7112734/20ae5140-e1cd-11e4-956e-49a7d9d12df2.png

60 second strace:

% time seconds usecs/call calls errors syscall

79.78 18.854636 67 282022 96950 futex 16.43 3.884003 82 47561 1 epoll_wait 2.86 0.675084 7 91441 40 select 0.43 0.101768 2 50400 4397 read 0.33 0.077317 3 27993 write 0.13 0.029989 1 20670 sched_yield 0.04 0.009583 266 36 24 restart_syscall 0.00 0.000285 1 563 70 rt_sigreturn

100.00 23.632665 520686 101482 total

ipfs.out QmPh7Srs3W1ChrB9Bwfaz9cfxKDfbk7r2enGHwZ5etKChJ ipfs.cpuprof QmePD76ngy2cLJYm4FhzDqY5HXo4WY7Voxzb8nTq8xsn7t ipfs.memprof QmS6Qb94DEEwH3USaJdUn1BoSzYvpf74envHGyEpnyNWt1

— Reply to this email directly or view it on GitHub https://github.com/ipfs/go-ipfs/issues/1036#issuecomment-92287092.

okket commented 9 years ago

It was the master from two days ago. Dang, I should have noted the commit id.

Idea: IPFS should hardcode current git commit id into the binary and output it during startup, along with the official version number.

jbenet commented 9 years ago

Idea: IPFS should hardcode current git commit id into the binary and output it during startup, along with the official version number.

Agreed. see #691 #693

The build process could do this, but it wouldn't be go-gettable, as modifying the binaries doesn't happen on go get.

whyrusleeping commented 9 years ago

the stack trace looks fairly sane. 60 open socket connections, a few streams. Nothing huge by any means.. this is bizarre

whyrusleeping commented 9 years ago

@okket do you still have the binary that you ran for those profiles? pprof really wants to have the same binary they were run with

okket commented 9 years ago

No, the binary was deleted. I'll keep it for the next round(s).

okket commented 9 years ago

Here is the link to the data files from the server restart today:

http://gateway.ipfs.io/ipfs/QmUz7ij4KaM7Z5h9A4o7w1oahQUDwT2CiYpXA43HTGQYy7

okket commented 9 years ago

Another restart:

http://gateway.ipfs.io/ipfs/QmYdXfm5318xNGSaMznYW8iT8HVK8xdDPDhXTUYCXDkDDS

okket commented 9 years ago

This time the daemon crashed with "runtime: out of memory"

http://gateway.ipfs.io/ipfs/QmSSvfWum6QDFxX38m8YT5vHTpT9pxgWywZURHPcVf44yY

whyrusleeping commented 9 years ago

@okket thats what i get on mars every night, lol. thank you for the logs, i'll let you know if I find anything incriminating on them

whyrusleeping commented 9 years ago

@okket So, ive seen this exact trace hundreds of times, but for some reason i decided to look at yours a little more closely. And i found the out of memory issue:

This line in the panic goroutine:

runtime.mallocgc(0xb1b99a33, 0xa8a320, 0xc200000001, 0xc208fa9c20)

Looks like its pretty normal, a function taking some pointers to some things... except that first argument isnt a pointer. Its the allocation size. Looking deeper, it appears that we ask msgio what the next message size is, it tells us this giant number, we say "okay, lets make some memory" and blindly allocate it. I believe we need to set an upper limit on these messages, and/or find out who is sending them.

whyrusleeping commented 9 years ago

@jbenet o/ PING

jbenet commented 9 years ago

wow, good find!

okket commented 9 years ago

finally :)

whyrusleeping commented 9 years ago

:clap:

jbenet commented 9 years ago

thanks @okket @whyrusleeping -- this is great -- :clap: :clap:

ipfs / kubo

Huge amount of context switches, growing over time #1036