pharo-project / pharo-vm

This is the VM used by Pharo
113 stars 68 forks source link

Adding an implementation of the aio.c using EPOLL in Linux. #805

Closed tesonep closed 3 months ago

tesonep commented 4 months ago

Avoiding the limit of file descriptors

akgrant43 commented 3 months ago

Hi Pablo,

Thanks very much for this!

I've been running a GT VM with this patch for a couple of weeks, and under light load (and a small number of sockets) it works fine. Unfortunately it is crashing the VM in what appears to be two separate situations:

1) under load

This isn't very reproducable, but a couple of symptoms I've seen:

Socket count: 998
Socket count: 999
Socket count: 1000

Thread 2 "PharoVM" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff7b0c6c0 (LWP 51339)]
0x0000000320001354 in ?? ()
(gdb) info threads
  Id   Target Id                                           Frame 
  1    Thread 0x7ffff7b108c0 (LWP 51336) "GlamorousToolki" 0x00007ffff7c3663d in syscall ()
   from /nix/store/qn3ggz5sf3hkjs2c797xf7nan3amdxmp-glibc-2.38-27/lib/
* 2    Thread 0x7ffff7b0c6c0 (LWP 51339) "PharoVM"         0x0000000320001354 in ?? ()
  3    Thread 0x7ffff58476c0 (LWP 51340) "PharoVM"         0x00007ffff7bfdaf5 in clock_nanosleep@GLIBC_2.2.5 ()
   from /nix/store/qn3ggz5sf3hkjs2c797xf7nan3amdxmp-glibc-2.38-27/lib/
(gdb) thread 1
[Switching to thread 1 (Thread 0x7ffff7b108c0 (LWP 51336))]
#0  0x00007ffff7c3663d in syscall () from /nix/store/qn3ggz5sf3hkjs2c797xf7nan3amdxmp-glibc-2.38-27/lib/
(gdb) bt
#0  0x00007ffff7c3663d in syscall () from /nix/store/qn3ggz5sf3hkjs2c797xf7nan3amdxmp-glibc-2.38-27/lib/
#1  0x000055555572be94 in std::sys::unix::futex::futex_wait () at library/std/src/sys/unix/
#2  std::sys_common::thread_parking::futex::Parker::park () at library/std/src/sys_common/thread_parking/
#3  std::thread::park () at library/std/src/thread/
#4  0x000055555567c9d2 in std::sync::mpmc::list::Channel<T>::recv::{{closure}}::h834c02460d2fa055 ()
#5  0x000055555567c72b in std::sync::mpmc::list::Channel<T>::recv::h1aef3010e00e4ed6 ()
#6  0x000055555567b4fd in vm_runtime::event_loop::EventLoop::run::h542c8ac55af408e0 ()
#7  0x0000555555675943 in vm_runtime::constellation::Constellation::run::hf8ae69d06de4ab09 ()
#8  0x000055555558f9e7 in vm_client_cli::main::h1f12db8a14167363 ()
#9  0x00005555555949d3 in std::sys_common::backtrace::__rust_begin_short_backtrace::h8d0b389f33313915 ()
#10 0x0000555555594a09 in std::rt::lang_start::{{closure}}::hc8dabe638680900a ()
#11 0x000055555572b827 in core::ops::function::impls::{impl#2}::call_once<(), (dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe)> () at library/core/src/ops/
#12 std::panicking::try::do_call<&(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe), i32> () at library/std/src/
#13 std::panicking::try<i32, &(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe)> () at library/std/src/
#14 std::panic::catch_unwind<&(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe), i32> () at library/std/src/
#15 std::rt::lang_start_internal::{closure#2} () at library/std/src/
#16 std::panicking::try::do_call<std::rt::lang_start_internal::{closure_env#2}, isize> () at library/std/src/
#17 std::panicking::try<isize, std::rt::lang_start_internal::{closure_env#2}> () at library/std/src/
#18 std::panic::catch_unwind<std::rt::lang_start_internal::{closure_env#2}, isize> () at library/std/src/
#19 std::rt::lang_start_internal () at library/std/src/
#20 0x00005555555949fe in std::rt::lang_start::h594095b100ee11f9 ()
#21 0x00007ffff7b52fce in __libc_start_call_main () from /nix/store/qn3ggz5sf3hkjs2c797xf7nan3amdxmp-glibc-2.38-27/lib/
#22 0x00007ffff7b53089 in __libc_start_main_impl () from /nix/store/qn3ggz5sf3hkjs2c797xf7nan3amdxmp-glibc-2.38-27/lib/
#23 0x0000555555587635 in _start ()

In these cases there have been less than 1023 sockets open, thus avoiding the next issue...

2) socketWritable() in SocketPluginImpl.c still calls select(), so will fail when a fd >= 1024 is used:

Socket count: 1498
Socket count: 1499
Socket count: 1500
*** stack smashing detected ***: terminated

Thread 2 "PharoVM" received signal SIGABRT, Aborted.
[Switching to Thread 0x7ffff7ad36c0 (LWP 14241)]
0x00007ffff7b7ed7c in __pthread_kill_implementation () from /nix/store/qn3ggz5sf3hkjs2c797xf7nan3amdxmp-glibc-2.38-27/lib/
(gdb) bt
#0  0x00007ffff7b7ed7c in __pthread_kill_implementation ()
   from /nix/store/qn3ggz5sf3hkjs2c797xf7nan3amdxmp-glibc-2.38-27/lib/
#1  0x00007ffff7b2f9c6 in raise () from /nix/store/qn3ggz5sf3hkjs2c797xf7nan3amdxmp-glibc-2.38-27/lib/
#2  0x00007ffff7b188fa in abort () from /nix/store/qn3ggz5sf3hkjs2c797xf7nan3amdxmp-glibc-2.38-27/lib/
#3  0x00007ffff7b19767 in __libc_message.cold () from /nix/store/qn3ggz5sf3hkjs2c797xf7nan3amdxmp-glibc-2.38-27/lib/
#4  0x00007ffff7c0d7f9 in __fortify_fail () from /nix/store/qn3ggz5sf3hkjs2c797xf7nan3amdxmp-glibc-2.38-27/lib/
#5  0x00007ffff7c0eaa4 in __stack_chk_fail () from /nix/store/qn3ggz5sf3hkjs2c797xf7nan3amdxmp-glibc-2.38-27/lib/
#6  0x00007fff9eaf1c08 in socketWritable (s=1088)
    at /home/alistair/gtvm/gtoolkit-vm/target/debug/build/vm-bindings-d83ce2bf61e9bb61/out/extracted/plugins/SocketPlugin/src/common/SocketPluginImpl.c:456
#7  0x00007fff9eaf3d83 in sqSocketSendDone (s=0x36017a1d0)
    at /home/alistair/gtvm/gtoolkit-vm/target/debug/build/vm-bindings-d83ce2bf61e9bb61/out/extracted/plugins/SocketPlugin/src/common/SocketPluginImpl.c:1227
#8  0x00007fff9eafa8f5 in primitiveSocketSendDone ()
    at /home/alistair/gtvm/gtoolkit-vm/target/debug/build/vm-bindings-d83ce2bf61e9bb61/out/extracted/plugins/SocketPlugin/src/common/SocketPlugin.c:2095
#9  0x00000003200017c8 in ?? ()
#10 0x0000010019378348 in ?? ()
#11 0x0000010000756218 in ?? ()
#12 0x00007ffff7ad2560 in ?? ()
#13 0x00007ffff7e42594 in interpret ()
    at /home/alistair/gtvm/gtoolkit-vm/target/debug/build/vm-bindings-d83ce2bf61e9bb61/out/generated/64/vm/src/cointerp.c:3030
Backtrace stopped: frame did not save the PC

(the reason the socket count gets up to 1500 is that the test program opens all the sockets, and then starts writing).

Once the socketWritable() issue is addressed, if you can supply me with a debug Pharo VM I can try to reproduce the crash in a vanilla Pharo VM.

The test harness I've been using is:

  1. Load This overwrites withTCPEchoServer: to keep track of all the connections and repeat the read and write cycle.
  2. Start the server:
Smalltalk vm maxExternalSemaphoresSilently: 8192.
TCPSocketEchoTest new runServer; yourself.
  1. Run 10 times simultaneously.

To reproduce the second issue just run 15 times simultaneously.

tesonep commented 3 months ago

Hi @akgrant43, I have fixed the implementation.

Cheers, Pablo

akgrant43 commented 3 months ago

Hi @tesonep ,

Thanks! I'll be able to test it next week.

Cheers, Alistair

akgrant43 commented 3 months ago

Hi Pablo,

This is looking much better!:

I'll continue to use it on my personal machine and will report if anything comes up, but from my perspective it's ready to release.

Do you have any idea of when it is likely to be released?

Thanks! Alistair

guillep commented 3 months ago

Thanks for the feedback! Very much appretiated!

tesonep commented 3 months ago

Thanks so much for checking, I was waiting for your Ok to start pushing the release.

akgrant43 commented 3 months ago

Great, thanks!