sustrik / dsock

An obsolete project
Other
82 stars 23 forks source link

Main loop errors out consistently after ~32758 connections received. #26

Closed ubergarm closed 7 years ago

ubergarm commented 7 years ago

While playing with libdill and dsock over at ubergarm/binks I've gotten stuck.

The quickest way to repeat this is to run a slightly modified step3.c tutorial to print out how many loops its run e.g.:

int cnt=0;
    while(1) {
        int s = tcp_accept(ls, NULL, -1);
        assert(s >= 0);
        s = crlf_start(s);
        assert(s >= 0);
        int cr = go(dialogue(s));
        assert(cr >= 0);
        printf("%d\n", ++cnt);
    }

Then hit it with wrk like:

wrk -t2 -c100 -d30s "http://localhost:5555"

The output looks like:

...
32754
32755
32756
32757
32758
Assertion failed: cr >= 0 (step3.c: main: 72)
Aborted (core dumped)

While I know step3.c isn't speaking proper HTTP at all, don't understand why this is happening across my various attempts to "fake" a simple web server response.

I've fiddled with the following possible solutions with no luck:

  1. Increasing Ephemeral Ports Range e.g. sudo sysctl -w net.ipv4.ip_local_port_range="1024 64000"
  2. Building both alpine:edge and debian:jessie Docker baseimage containers
  3. Binding directly to the host network with docker run --net=host

It may be related, but some tests occasionally fail during make check for both libdill and dsock.

Thanks! I'm looking forward to getting enough stability to benchmark some more!

APPENDIX A

gdb output from similar run as above in debian:jessie container

GNU gdb (Debian 7.12-6) 7.12.0.20161007-git
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from step3...done.
(gdb) run
Starting program: /app/src/step3
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
step3: step3.c:71: main: Assertion `cr >= 0' failed.

Program received signal SIGABRT, Aborted.
0x00007ffff7850fdf in raise () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt full
#0  0x00007ffff7850fdf in raise () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#1  0x00007ffff785240a in abort () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#2  0x00007ffff7849e47 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#3  0x00007ffff7849ef2 in __assert_fail ()
   from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#4  0x0000555555556dc6 in main (argc=1, argv=0x7fffffffed68) at step3.c:71
        s = 65501
        cr = -1
        port = 5555
        addr = {
          data = "\002\000\025\263\000\000\000\000-\vWUUU", '\000' <repeats 17
times>}
        rc = 0
        __PRETTY_FUNCTION__ = "main"
        ls = 0
(gdb) list
38     ssize_t sz = mrecv(s, inbuf, sizeof(inbuf), -1);
39     if(sz < 0) goto cleanup;
40     inbuf[sz] = 0;
41     char outbuf[256];
42     rc = snprintf(outbuf, sizeof(outbuf), "Hello, %s!", inbuf);
43     rc = msend(s, outbuf, rc, -1);
44     if(rc != 0) goto cleanup;
45 cleanup:
46     rc = hclose(s);
47     assert(rc == 0);
(gdb) quit

APPENDIX B

libdill test log in alpine:edge container

===================================================
   dsock 0.5-alpha-23-ge566cc7: ./test-suite.log
===================================================

# TOTAL: 18
# PASS:  17
# SKIP:  0
# XFAIL: 0
# FAIL:  1
# XPASS: 0
# ERROR: 0

.. contents:: :depth: 2

FAIL: tests/bthrottler
======================

Assertion failed: elapsed > 90 && elapsed < 120 (tests/bthrottler.c: main: 77)
FAIL tests/bthrottler (exit status: 134)

APPENDIX C

dsock test log in alpine:edge container

====================================================
   libdill 1.1-2-g697a908-dirty: ./test-suite.log
====================================================

# TOTAL: 12
# PASS:  11
# SKIP:  0
# XFAIL: 0
# FAIL:  1
# XPASS: 0
# ERROR: 0

.. contents:: :depth: 2

FAIL: tests/threads
===================

a
Assert failed: !cr->ready.next (cr.c:402)
FAIL tests/threads (exit status: 134)
sustrik commented 7 years ago

What's the errno from the failure?

ubergarm commented 7 years ago

errno is Out of memory [12]

Here is a simple test "server": dsock-server.c

Testing it with wrk repeatably gives this output on my machine:

...
32620
32621
32622
32623
32624
32625Out of memory [12] (dsock-server.c:78)
Aborted (core dumped)

Hoping this is a "userland" error, like am I cleaning up/closing sockets/protocols or coroutines wrong?

Thanks!

ubergarm commented 7 years ago

After I decided to RTFM it seems that a coroutine does not clean itself up on return and requires an explicit hclose(h) on its handle.

I believe the same mistake is made in the libdill/perf/whisper.c. It works fine when invoked with whispers 32000 but hangs with whispers 34000 etc...

So without digging in more into the internals, it seems that if one needs more than ~2^16 simultaneous running coroutines, one should allocate the stack manually and use go_mem().

Some of my confusion comes from hclose() working across both libdill and dsock for both sockets/protocols/channels/coroutines. Its convenient but not immediately explicit imo.

Closing this issue and figure out a way to use a single coroutine to garbage collect completed coroutines stack space by signalling completion through a channel and then hclose(handle)... Will post here with a working example hopefully sooner than later.

Thanks!

sustrik commented 7 years ago

Yep, that's right. Coroutine handle should be closed to deallocate the stack.

I think I need to fix the tutorial and some of the examples to take that into account.

ubergarm commented 7 years ago

Cool, the madness is slowly making sense. ;)

While this is more of a libdill specific thing, I made a rough garbage collector demo over here at ubergarm/binks/src/garbage.c

Is there a way for a coroutine to access self.handle or similar through the exposed API? Otherwise one could pass the coroutines handle to it after starting it through a channel... I can't think of a more elegant way to do this right now...

Cheers!

sustrik commented 7 years ago

So, the problem is as follows:

Structured concurrency, i.e. systematic management of coroutine lifetimes (see here: http://libdill.org/structured-concurrency.html) is only possible if parent coroutine owns the child coroutine and kills it before finishing itself.

If coroutine is left running unattended, it's a possible coroutine leak. Also, it means that it will be forcefully shut down in middle of doing stuff when process exits.

Now, one can argue that in examples and toy applications it would be nice if a coroutine could close itself so that code is kept simple. On the other hand, do we want examples do stuff that would be considered unsafe in real-world application?

And to answer your question: No, there's no way to get handle of the current coroutine.

sustrik commented 7 years ago

Hm. Maybe it would be worthwhile to have a helper object to hold a set of handles and close them when it itself if closed:

int gc = gcmake();
int cr1 = go(worker());
gcadd(gc, cr1);
int cr2 = go(worker());
gcadd(gc, cr2);
...
hclose(gc); // both cr1 and cr2 are closed here
ubergarm commented 7 years ago

After sleeping on it, there is probably a better way to structure this.

This toy example was trying to:

  1. Get a piece of work
  2. Spawn a new coroutine to handle each piece of work
  3. Repeat

A better approach might be to:

  1. Create N coroutines ahead of time and pass in a channel
  2. Feed work into the channel
  3. The same N coroutines loop until signaled to exit with hclose() from the parent

gcadd() and hclose(gc) ability would make this paradigm quite clean.

That feels pretty good. I'll keep playing with it!