janet-lang / janet

A dynamic language and bytecode vm
https://janet-lang.org
MIT License
3.38k stars 217 forks source link

The default meson option `epoll=false` causes a busy loop and segmentation fault. #1391

Closed amano-kenji closed 2 months ago

amano-kenji commented 5 months ago

If I build janet with the default meson option epoll=false, then the following code causes a busy loop.

(def proc (os/spawn ["ls"] :p {:out :pipe}))
(print (ev/read (proc :out) :all))

(forever
  (ev/sleep 1))

The workaround is to replace ev/sleep with os/sleep, remove {:out :pipe}, or close proc with os/proc-close which closes pipes and calls os/proc-wait if it hasn't been called already.

If I build janet with epoll=true, then the issue goes away.

This issue was separated from https://github.com/janet-lang/janet/issues/1386

amano-kenji commented 5 months ago

With epoll=false, the following code causes segmentation fault.

(import spork/sh)
(def devnull (sh/devnull))
(ev/spawn-thread (ev/write devnull "ok"))

Welcome to hell.

sogaiu commented 5 months ago

I got the segfault as well.

I edited src/conf/janetconf.h to have:

#define JANET_EV_NO_EPOLL

and rebuilt janet to reproduce with the code in the comment above.

I don't know how reliable the results are but I used rr to record execution and then snooped around a bit using rr replay.

This is the output I got around the report of a segfault:

Thread 2 received signal SIGSEGV, Segmentation fault.
0x00007f05b167366e in janet_stream_close_impl (stream=<optimized out>) at src/core/ev.c:349
349         JanetStream *last = janet_vm.streams[j];

I guess that's this line.

bakpakin commented 4 months ago

The issue is the copying of streams between threads. There is some thread local (interpreter global) state that needs copying. This only exists for the poll implementation, though.

bakpakin commented 4 months ago

This should be fixed for me as of 12630d3e546cf5a1105cfabc2513d91bdfd23031 - the issue was janet_unmarshal_stream didn't re-register the stream with the event loop.

amano-kenji commented 4 months ago

I built the latest commit with epoll=false, and this issue can still be reproduced with the sample code above.

bakpakin commented 4 months ago

Is the segfault still happening? The 100% cpu is not yet addressed

amano-kenji commented 4 months ago

At least, the segmentation fault is gone now. The 100% cpu usage is not gone, yet.

czkz commented 3 months ago

I used "TCP Echo Server" example from README as the server and janet -e '(def c (net/connect "127.0.0.1" 8000)) (:write c "abc") (ev/sleep 1e10)' as client. The server gets stuck in a busy loop after the client sends some data. Janet compiled with meson setup build && ninja -C build.

0ff8f58be87b00132e43f9a86483e5522a0d1c66 is the first bad commit. The commit is quite big unfortunately.

Server strace before commit (fd=5 - connection socket):

<...>
poll([{fd=3, events=POLLIN}, {fd=6, events=POLLIN}, {fd=5, events=POLLIN}], 3, -1
<waits for data>

Server strace after commit (fd=5 - connection socket):

<...>
poll([{fd=3, events=POLLIN}, {fd=6, events=POLLIN}, {fd=5, events=POLLIN|POLLOUT}], 3, -1) = 1 ([{fd=5, revents=POLLOUT}])
poll([{fd=3, events=POLLIN}, {fd=6, events=POLLIN}, {fd=5, events=POLLIN|POLLOUT}], 3, -1) = 1 ([{fd=5, revents=POLLOUT}])
poll([{fd=3, events=POLLIN}, {fd=6, events=POLLIN}, {fd=5, events=POLLIN|POLLOUT}], 3, -1) = 1 ([{fd=5, revents=POLLOUT}])
<...>

From man poll: POLLIN - There is data to read. POLLOUT - Writing is now possible.

Janet doesn't want to write anything to the socket, yet still waits for it to be writable. Poll immediately signals that it is writable, hence the busy loop.

bakpakin commented 2 months ago

Found the issue, the problem is that passing a valid file descriptor to poll with events = 0 immediately returns with POLLHUP.

See https://groups.google.com/g/comp.unix.programmer/c/bNNadBIEpTo/m/G5gs1mqNhbIJ?pli=1 for a conversation and workaround.

bakpakin commented 2 months ago

@czkz Thanks for that secondary repro, that seems to be a closely related issue with the poll backend. This is not fixed by https://github.com/janet-lang/janet/commit/a9b8f8e8a9d75fee10a3d036854d283a9d10600e however