EAGAIN when a new system thread can't be created

koen-struyve commented 6 years ago

If lwt_unix_launch_thread (https://github.com/ocsigen/lwt/blob/master/src/unix/lwt_unix_stubs.c#L195) fails to fork (for example when hitting the system-wide limit pid_max), one has that the thread_count has been incremented (https://github.com/ocsigen/lwt/blob/master/src/unix/lwt_unix_stubs.c#L1286) but never gets decremented.

This causes a discrepancy in thread_count versus the actual number of worker threads. In particular we saw a process with thread_count at 1003, while the number of actual worker threads in use was only 736.

aantron commented 6 years ago

I tried reproducing this on my macOS system, with what may be success. The program was:

(* ocamlfind opt -linkpkg -package lwt.unix thread_count.ml && ./a.out *)

open Lwt.Infix

let () =
  Lwt_unix.set_pool_size 10000;
  Printf.printf "'Lwt_unix.pool_size ()' is %i\n" (Lwt_unix.pool_size ());

  let old_hook = !Lwt.async_exception_hook in
  Lwt.async_exception_hook := (function
    | Unix.(Unix_error (EAGAIN, _, _)) -> ()
    | exn -> old_hook exn);

  let blocking_stdin =
    Lwt_unix.of_unix_file_descr ~blocking:true ~set_flags:true Unix.stdin in
  let dummy_buffer = Bytes.create 1 in

  let rec start_blocking_io_calls () =
    (* The [Lwt.async] only makes sure that exceptions from [read] get
       reported. *)
    Lwt.async (fun () ->
      Lwt_unix.read blocking_stdin dummy_buffer 0 1 >|= fun count ->
      Printf.printf "%i %c" count (Bytes.get dummy_buffer 0));

    Printf.printf
      "'Lwt_unix.thread_count ()' is now %i\n%!" (Lwt_unix.thread_count ());

    start_blocking_io_calls ()
  in
  start_blocking_io_calls ()

and I modified Lwt_unix.read to always create a worker thread:

diff --git a/src/unix/lwt_unix.cppo.ml b/src/unix/lwt_unix.cppo.ml
index 6e28451..e343869 100644
--- a/src/unix/lwt_unix.cppo.ml
+++ b/src/unix/lwt_unix.cppo.ml
@@ -633,7 +633,7 @@ let read ch buf pos len =
   else
     Lazy.force ch.blocking >>= function
     | true ->
-      wait_read ch >>= fun () ->
+      (* wait_read ch >>= fun () -> *)
       run_job (read_job ch.fd buf pos len)
     | false ->
       wrap_syscall Read ch (fun () -> stub_read ch.fd buf pos len)

This shows a thread count of 10000 according to Lwt, while top shows 2048 system threads actually started. Note that I had to handle the EAGAIN exception from launch_thread in order to get this to work, are you handling it by any chance?

Without the handler, the result was

[snip]
'Lwt_unix.thread_count ()' is now 2045
'Lwt_unix.thread_count ()' is now 2046
'Lwt_unix.thread_count ()' is now 2047
Fatal error: exception Unix.Unix_error(Unix.EAGAIN, "launch_thread", "")
Raised by primitive operation at file "src/unix/lwt_unix.cppo.ml", line 192, characters 5-31
Called from file "thread_count.ml", line 17, characters 6-51
Called from file "src/core/lwt.ml", line 2210, characters 16-20

Either way, it seems undesirable. I think Lwt should catch this EAGAIN and behave as if the pool limit was reached, i.e. run the call synchronously. This is even more true given that this source of EAGAIN applies to almost every system call, but is specific to Lwt, so nobody will be warned about it in any man pages. After catching the exception, the thread count should indeed be decremented.

aantron commented 6 years ago

Okay, the attached commit at least keeps the worker thread count synchronized with reality (at least more so than before). Making Lwt "handle" the failure internally, act as if the pool limit was reached, and run the job synchronously, will take some local refactoring of lwt_unix_start_job. That function looks like it has several problems.

koen-struyve commented 6 years ago

We did encounter EAGAIN failures. While investigating the root cause in our code via gdb we noticed the discrepancy between thread_count and the actual number of threads spawned.

Thanks for the quick input.

aantron commented 6 years ago

Thanks for the response. Is it correct to assume that you want the whole EAGAIN fixed, not just have a correct thread count? :)

koen-struyve commented 6 years ago

Fixing the EAGAIN would be nice, but there still is the problem that (Lwt worker) threads are not released, so other processes on the same system will keep hitting the same thread count boundary. From a practical point of view fixing this through configuration (by the library user) is the better option.

aantron commented 6 years ago

Agreed, I remember worrying about thread releasing as well. I think a workaround that may be practical for now is to call Lwt_unix.set_pool_size and set the size to something relatively small. 1000 does seem like a pretty large default, given the above problems.

Can you give an idea of how important it is for your usage to address this fully in the near term? We are also planning to do some major engineering on Lwt_unix that may make all of these problems go away as a side-effect, but the ETA for that might be six months or more, and it would be helpful to get an idea of the relative priorities and time scales :)

koen-struyve commented 6 years ago

The set_pool_size approach suffices for our purposes, so from our usage point of view there is no urgency in this.

aantron commented 6 years ago

Thanks for the info. I'm renaming the issue to reflect the work remaining. Thanks for reporting!

ocsigen / lwt

EAGAIN when a new system thread can't be created #493