janet-lang / janet

A dynamic language and bytecode vm
https://janet-lang.org
MIT License
3.38k stars 217 forks source link

Bugs when cancelling a waiting fiber #1337

Closed czkz closed 4 months ago

czkz commented 7 months ago

Example 1

A fiber is resumed after it's cancelled.

(ev/spawn
  (ev/deadline 1)
  (os/shell "sleep 2"))

Output:

error: deadline expired
  in os/shell [src/core/os.c] on line 1441
  in _spawn [index.janet] on line 3, column 3
error: cannot resume fiber with status :error
  in os/shell [src/core/os.c] on line 1441
  in _spawn [index.janet] on line 3, column 3

Example 2

A fiber with edefer can be resumed in the error path by the wait in the normal path. Note that sh_edefer is printed after edefer_end, which should not be possible.

(ev/spawn
  (ev/deadline 0.1)
  (edefer
    (do
      (print "edefer start")
      (os/shell "sleep 10 && echo sh_edefer")
      (print "edefer end"))
    (print "main start")
    (os/shell "sleep 5 && echo sh_main")
    (print "main end")))

Output (without stacktraces):

main start
edefer start

<5 seconds pass>

sh_main
edefer end
error: deadline expired

<5 seconds pass>

sh_edefer
error: cannot resume fiber with status :error

Example 3

This one is in a sizable project, and I didn't find the exact cause. It's similar to the second example, but janet error messages look strange, and valgrind finds some kind of undefined behavior. When I remove edefer it works fine.

I've included it just in case the stacktraces are useful.

Output of "valgrind janet index.janet" ```txt ==1391== Memcheck, a memory error detector ==1391== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al. ==1391== Using Valgrind-3.22.0 and LibVEX; rerun with -h for copyright info ==1391== Command: janet index.janet ==1391== ==1391== Invalid read of size 1 ==1391== at 0x4849F24: strlen (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so) ==1391== by 0x489AD91: janet_formatbv (in /usr/lib/libjanet.so.1.32.1) ==1391== by 0x48C0D7A: janet_dynprintf (in /usr/lib/libjanet.so.1.32.1) ==1391== by 0x48C100C: janet_stacktrace_ext (in /usr/lib/libjanet.so.1.32.1) ==1391== by 0x48C709B: janet_loop1 (in /usr/lib/libjanet.so.1.32.1) ==1391== by 0x48C71FC: janet_loop (in /usr/lib/libjanet.so.1.32.1) ==1391== by 0x48C77F2: janet_loop_fiber (in /usr/lib/libjanet.so.1.32.1) ==1391== by 0x10A5DF: ??? (in /usr/bin/janet) ==1391== by 0x495418B: (below main) (libc_start_call_main.h:58) ==1391== Address 0x54ac030 is 0 bytes after a block of size 32 alloc'd ==1391== at 0x4841784: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so) ==1391== by 0x48889C5: ??? (in /usr/lib/libjanet.so.1.32.1) ==1391== by 0x488BF7E: janet_symbol_gen (in /usr/lib/libjanet.so.1.32.1) ==1391== by 0x489FF2F: ??? (in /usr/lib/libjanet.so.1.32.1) ==1391== by 0x48BEC2F: ??? (in /usr/lib/libjanet.so.1.32.1) ==1391== by 0x48C3ECF: ??? (in /usr/lib/libjanet.so.1.32.1) ==1391== by 0x48BDA16: ??? (in /usr/lib/libjanet.so.1.32.1) ==1391== by 0x48C3ECF: ??? (in /usr/lib/libjanet.so.1.32.1) ==1391== by 0x48C3D0E: ??? (in /usr/lib/libjanet.so.1.32.1) ==1391== by 0x48C6DEA: janet_loop1 (in /usr/lib/libjanet.so.1.32.1) ==1391== by 0x48C71FC: janet_loop (in /usr/lib/libjanet.so.1.32.1) ==1391== by 0x48C77F2: janet_loop_fiber (in /usr/lib/libjanet.so.1.32.1) ==1391== error: _00007z0 in ev/sleep [src/core/ev.c] on line 2936 in [kb.janet] on line 26, column 11 in eval-cmd-queue [kb.janet] (tailcall) on line 14, column 3 in _spawn [kb.janet] on line 64, column 14 error: _00007A0 in os/proc-wait [src/core/os.c] on line 612 in emit [evemu/event.janet] on line 23, column 7 in [kb.janet] on line 25, column 11 in eval-cmd-queue [kb.janet] (tailcall) on line 14, column 3 in _spawn [kb.janet] on line 64, column 14 error: cannot resume fiber with status :error in os/proc-wait [src/core/os.c] on line 612 in emit [evemu/event.janet] on line 23, column 7 in [kb.janet] on line 25, column 11 in eval-cmd-queue [kb.janet] (tailcall) on line 14, column 3 in _spawn [kb.janet] on line 64, column 14 ^C ==1391== ==1391== Process terminating with default action of signal 2 (SIGINT) ==1391== at 0x4A2639A: __libc_read (read.c:26) ==1391== by 0x4A2639A: read (read.c:24) ==1391== by 0x48C59B8: ??? (in /usr/lib/libjanet.so.1.32.1) ==1391== by 0x489EBAE: janet_async_start (in /usr/lib/libjanet.so.1.32.1) ==1391== by 0x489EC19: ??? (in /usr/lib/libjanet.so.1.32.1) ==1391== by 0x489EC32: janet_ev_read (in /usr/lib/libjanet.so.1.32.1) ==1391== by 0x48A5ED2: janet_cfun_stream_read (in /usr/lib/libjanet.so.1.32.1) ==1391== by 0x48BEC2F: ??? (in /usr/lib/libjanet.so.1.32.1) ==1391== by 0x48C3ECF: ??? (in /usr/lib/libjanet.so.1.32.1) ==1391== by 0x48C3D0E: ??? (in /usr/lib/libjanet.so.1.32.1) ==1391== by 0x48C6DEA: janet_loop1 (in /usr/lib/libjanet.so.1.32.1) ==1391== by 0x48C71FC: janet_loop (in /usr/lib/libjanet.so.1.32.1) ==1391== by 0x48C77F2: janet_loop_fiber (in /usr/lib/libjanet.so.1.32.1) ==1391== ==1391== HEAP SUMMARY: ==1391== in use at exit: 4,816,266 bytes in 50,098 blocks ==1391== total heap usage: 230,598 allocs, 180,500 frees, 22,067,016 bytes allocated ==1391== ==1391== LEAK SUMMARY: ==1391== definitely lost: 0 bytes in 0 blocks ==1391== indirectly lost: 0 bytes in 0 blocks ==1391== possibly lost: 912 bytes in 3 blocks ==1391== still reachable: 4,815,354 bytes in 50,095 blocks ==1391== suppressed: 0 bytes in 0 blocks ==1391== Rerun with --leak-check=full to see details of leaked memory ==1391== ==1391== For lists of detected and suppressed errors, rerun with: -s ==1391== ERROR SUMMARY: 2 errors from 1 contexts (suppressed: 0 from 0) ```
bakpakin commented 7 months ago

The issue here seems to be with the os_shell function, and the fact that is implemented to run on a another thread, and then schedule the original fiber when it completes.

The second instance is a bit more complex, as the initial deadline cancellation triggers the second piece of code to run, which should also immediately expire the deadline.

bakpakin commented 7 months ago

Hard to tell from just the valgrind output, but I think the invalid read is from calling strlen on a symbol created by (gensym). Most function internal to Janet don't require a trailing 0 byte for strings, but all created strings should have such a byte just in case.

bakpakin commented 7 months ago

@czkz Check now for examples 1 and 3. Example 2 I think should still have some strange behavior, but that is because the construction is strange. I have not yet decided on why the correct behavior here should be - my current thought is that "edefer end" and "main end" should not print, but everything else should.

czkz commented 7 months ago

The issue here seems to be with the os_shell function

Note: replacing os/shell with os/spawn resulted in the same behavior (but both are now fixed).

Check now for examples 1 and 3.

Example 1 is fixed for both os/shell and os/spawn. Example 3 now produces less errors :)

the invalid read is from calling strlen on a symbol created by (gensym)

Indeed! This example triggers the valgrind error (even on the new version):

(ev/cancel
  (ev/spawn (ev/sleep 1e9))
  (gensym))

It also spits random garbage in the error, e.g.

error: _00000r0Mw\
                  A
  in _spawn [tst2.janet] on line 2, column 13
czkz commented 7 months ago

I have not yet decided on why the correct behavior here should be - my current thought is that "edefer end" and "main end" should not print, but everything else should.

It could work as if edefer body was offloaded to a new fiber. Not necessarily one-to-one, but something similar to this:

(ev/spawn
  (ev/deadline 0.1)
  (edefer
    (ev/spawn # the only line different from example 2
      (print "edefer start")
      (os/shell "sleep 10 && echo sh_edefer")
      (print "edefer end"))
    (print "main start")
    (os/shell "sleep 5 && echo sh_main")
    (print "main end")))

output:

main start
error: deadline expired
edefer start
sh_main
sh_edefer
edefer end