janet-lang / janet

A dynamic language and bytecode vm
https://janet-lang.org
MIT License
3.38k stars 217 forks source link

eventloop/networking gets stuck on musl #1431

Closed l63l closed 2 months ago

l63l commented 3 months ago

Following happens only when building janet with musl libc since 5f2e287efdbd0877751c346572e645b02f67b75a . When hitting a simple spork/http/server with a load-testing tool like wrk and about 20 parallel connections, the server stops responding on the second invocation of wrk.

And this only happens when using ev/go (which http/server does), but not with ev/call:

(def a (net/listen "127.0.0.1" "8000"))

(defn handler [con]
  (defer (:close con)
    (ev/read con 1000)))

# gets stuck:
(ev/go (fn [] (net/accept-loop a handler)))

# doesn't:
# (forever
#   (def con (net/accept a))
#   (ev/call handler con))

And a client like:

(loop [i :range [0 20]]
  (ev/spawn-thread
    (forever
      (with [con (net/connect "127.0.0.1" "8000" :stream)]
        (:write con "AAA")
        (def res (:read con 1024))
        (print con res)
        (:close con)))))

To get a musl build of janet for example on debian:

apt-get install musl-tools
make CC=musl-gcc
pepe commented 3 months ago

Hello. I am using musl build on Alpine, and I have yet to notice the behavior you are describing. I will try your reproduction code and report back. Thanks for reporting.

pepe commented 3 months ago

I have tried it with the latest Janet from the master branch on Alpine Linux with musl at the time of this writing and see no issues like the one you are describing. We will need more information about your setup.

l63l commented 3 months ago

hm, thanks for testing. I tested on void-linux with host musl-libc (1.1.24), alpine 3.17 vm (musl 1.2.3) and debian bullseye with musl-gcc. Now I fetched alpine 3.19.2 and build janet from master and could reproduce it, too.

You need to keep the "server snippet" running, start the "client snippet", kill the "client snippet" and the second invocation will get stuck.

pepe commented 3 months ago

Oh. I did not get the process right. I am trying it again.

pepe commented 3 months ago

Yep. I can confirm that the problem manifests in this case on Alpine.

psagers commented 2 months ago

I'm experiencing similar behavior on Arch with glibc. This is Janet version 1.34.0-f92f3eb6, built from the repo. I don't see the issue on FreeBSD with an equivalent build (from the same commit).

Here's my repro setup:

(var n-req 0)

(defn handler
  [stream]
  (defer (net/close stream)
    (ev/sleep (math/random))
    (set n-req (inc n-req))
    (printf "%d: %N" n-req (net/peername stream))
    (net/write stream (string/format "%3d: Accepted\n" n-req))))

(defn main
  [&]
  (let [server (net/listen "127.0.0.1" 9000)]
    (net/accept-loop server handler)))

In another shell:

for i in $(seq 100)
do nc -d -w 2 127.0.0.1 9000 &
done

Once the accept-loop gets "stuck", the symptom I see is that some (but not all) connections are accepted and sent to the handler and some of those (but not all) result in data actually being written back to the client. The issue seems to accumulate, such that successive batches of 100 requests result in fewer and fewer successful responses.

I tried (forever (ev/call handler (net/accept server))) and that does seem to work correctly for me as well.

Updated: I see the same behavior with the official build as well.

bakpakin commented 2 months ago

So took another look at this now that I have repro that works on my system, and the root cause is in how we setup epoll/kqueue in edge trigger mode.

This works well for most cases, but for net/accept-loop, this can cause connections to be missed in these micro benchmarks where we make many connections back to back.

The fix is conceptual fairly simple, but I would like to make server sockets use edge trigger mode unless we call net/accept-loop, in which they switch to level trigger mode and make sure we fix kqueue too, which seems to have the same wrong logic here.

bakpakin commented 2 months ago

POC fix that just make all server sockets run in level trigger mode (so not a "real" solution as it would cause a busy loop in the case with (forever (net/accept ...))

diff --git a/src/core/ev.c b/src/core/ev.c
index 335b9738..e373c3f1 100644
--- a/src/core/ev.c
+++ b/src/core/ev.c
@@ -1543,8 +1543,12 @@ static JanetTimestamp ts_now(void) {
 /* Wait for the next event */
 static void janet_register_stream(JanetStream *stream) {
     struct epoll_event ev;
-    ev.events = EPOLLET;
-    if (stream->flags & (JANET_STREAM_READABLE | JANET_STREAM_ACCEPTABLE)) ev.events |= EPOLLIN;
+    if (stream->flags & JANET_STREAM_ACCEPTABLE) {
+        ev.events = EPOLLIN; /* Do NOT edge trigger server sockets */
+    } else {
+        ev.events = EPOLLET;
+    }
+    if (stream->flags & JANET_STREAM_READABLE) ev.events |= EPOLLIN;
     if (stream->flags & JANET_STREAM_WRITABLE) ev.events |= EPOLLOUT;
     ev.data.ptr = stream;
     int status;
@@ -1695,7 +1699,9 @@ static void timestamp2timespec(struct timespec *t, JanetTimestamp ts) {
 void janet_register_stream(JanetStream *stream) {
     struct kevent kevs[2];
     int length = 0;
-    if (stream->flags & (JANET_STREAM_READABLE | JANET_STREAM_ACCEPTABLE)) {
+    if (stream->flags & JANET_STREAM_ACCEPTABLE) {
+        EV_SETx(&kevs[length++], stream->handle, EVFILT_READ, EV_ADD | EV_ENABLE, 0, 0, stream);
+    } else if (stream->flags & JANET_STREAM_READABLE) {
         EV_SETx(&kevs[length++], stream->handle, EVFILT_READ, EV_ADD | EV_ENABLE | EV_CLEAR, 0, 0, stream);
     }
     if (stream->flags & JANET_STREAM_WRITABLE) {
bakpakin commented 2 months ago

@psagers, @l63l or @pepe, let me know if this fixes the problem (especially the musl builds that I can't easily verify) and we can close this out.

psagers commented 2 months ago

This appears to be working for me. Thanks!

l63l commented 2 months ago

musl builds are also fixed. Thanks.