Notice processor and async operation

paurkedal commented 4 years ago

I found an issue when registering a notice processor when using asynchronous request which triggers a notice. The issue can be reproduced by the following code

let statements = [
  "CREATE TEMPORARY TABLE notice_processor_and_async_1 \
    (id serial PRIMARY KEY, name text NOT NULL)";
  "CREATE TEMPORARY TABLE notice_processor_and_async_2 \
    (id integer PRIMARY KEY REFERENCES notice_processor_and_async_1)";
  "DROP TABLE notice_processor_and_async_1 CASCADE";
]

let wait_for_result c =
  c#consume_input;
  while c#is_busy do
    ignore (Unix.select [Obj.magic c#socket] [] [] (-.1.0));
    c#consume_input
  done

let fetch_result c = wait_for_result c; c#get_result

let fetch_single_result c =
  match fetch_result c with
  | None -> assert false
  | Some r -> assert (fetch_result c = None); r

let () = Printexc.register_printer @@ function
 | Postgresql.Error err -> Some (Postgresql.string_of_error err)
 | _ -> None

let () =
  let conn = new Postgresql.connection ~conninfo:Sys.argv.(1) () in
  conn#set_notice_processor print_endline;
  statements |> List.iter begin fun stmt ->
    conn#send_query stmt;
    let r = fetch_single_result conn in
    assert (r#status = Postgresql.Command_ok)
  end

The following backtrace shows that it gets stuck in caml_leave_blocking_section in the notice_ml wrapper trying to acquire a caml_master_lock.

(gdb) run postgresql://
Starting program: /home/urkedal/proj/postgresql-ocaml/_build/default/examples/notice_processor_and_async.exe postgresql://
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
^C
Program received signal SIGINT, Interrupt.
0x00007ffff76739f3 in futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0x5555558ac458 <caml_master_lock+88>) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
88      ../sysdeps/unix/sysv/linux/futex-internal.h: No such file or directory.
(gdb) bt
#0  0x00007ffff76739f3 in futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0x5555558ac458 <caml_master_lock+88>) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x5555558ac400 <caml_master_lock>, 
    cond=0x5555558ac430 <caml_master_lock+48>) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x5555558ac430 <caml_master_lock+48>, mutex=0x5555558ac400 <caml_master_lock>)
    at pthread_cond_wait.c:655
#3  0x00005555556225be in st_masterlock_acquire.constprop ()
#4  0x00005555556225f9 in caml_thread_leave_blocking_section ()
#5  0x000055555562bdb6 in caml_leave_blocking_section () at signals.c:167
#6  0x000055555561db21 in notice_ml (cb=0x5555558c3750, 
    msg=0x555555923880 "NOTICE:  drop cascades to constraint notice_processor_and_async_2_id_fkey on table notice_processor_and_async_2\n") at postgresql_stubs.c:1404
#7  0x00007ffff78a335a in ?? () from /usr/lib/x86_64-linux-gnu/libpq.so.5
#8  0x00007ffff78a365a in ?? () from /usr/lib/x86_64-linux-gnu/libpq.so.5
#9  0x00007ffff789a296 in PQisBusy () from /usr/lib/x86_64-linux-gnu/libpq.so.5
#10 0x000055555562043c in PQisBusy_stub (v_conn=<optimized out>) at postgresql_stubs.c:1107
#11 0x00005555555dcc2c in camlPostgresql__fun_2904 () at src/postgresql.ml:1121
#12 0x00005555555dac15 in camlPostgresql__protectx_963 () at src/postgresql.ml:755
#13 0x00005555555d1f6e in camlNotice_processor_and_async__wait_for_result_81 () at examples/notice_processor_and_async.ml:11
#14 0x00005555555d203d in camlNotice_processor_and_async__fetch_result_191 () at examples/notice_processor_and_async.ml:16
#15 0x00005555555d206d in camlNotice_processor_and_async__fetch_single_result_193 ()
    at examples/notice_processor_and_async.ml:19
#16 0x00005555555d21e5 in camlNotice_processor_and_async__fun_649 () at examples/notice_processor_and_async.ml:32
#17 0x00005555555ea1a1 in camlStdlib__list__iter_236 () at list.ml:110
#18 0x00005555555d23f6 in camlNotice_processor_and_async__entry () at examples/notice_processor_and_async.ml:30
#19 0x00005555555ceaa9 in caml_program ()
#20 0x0000555555647c10 in caml_start_program ()
#21 0x000055555562a995 in caml_startup_common (argv=0x7fffffffd458, pooling=<optimized out>, pooling@entry=0)
    at startup_nat.c:160
#22 0x000055555562a9fb in caml_startup_exn (argv=<optimized out>) at startup_nat.c:165
#23 caml_startup (argv=<optimized out>) at startup_nat.c:170
#24 0x00005555555ce6ec in main (argc=<optimized out>, argv=<optimized out>) at main.c:44

I'm not sure it it's the async code or the notice processor mechanism which is the issue, but I noticed the comment in notice_ml: It mentions segfault as potential issue, but if the lock is held, I guess a deadlock could also be a plausible symptom? And could the later contribution of async binding have affected the assumptions?

paurkedal commented 4 years ago

A (maybe unrelated) problem may be that notice_ml is called from is_busy, so I presume

diff --git a/src/postgresql_stubs.c b/src/postgresql_stubs.c
index 0a3fadf..fb95740 100644
--- a/src/postgresql_stubs.c
+++ b/src/postgresql_stubs.c
@@ -1104,7 +1104,7 @@ CAMLprim value PQgetResult_stub(value v_conn)
 }

 noalloc_conn_info_intnat(PQconsumeInput)
-noalloc_conn_info(PQisBusy, Val_bool)
+conn_info(PQisBusy, Val_bool)
 noalloc_conn_info_intnat(PQflush)
 noalloc_conn_info_intnat(PQsocket)

will be needed? This could be the case for some other functions, as well. This did not fix the current issue though.

mmottl commented 4 years ago

Thanks for the bug report! Yes, it seems that the stub for PQisBusy needed fixing. It's not just about registering the runtime value, but the call to PQisBusy also needs to happen outside of the runtime lock for correctness.

If you see other functions that could potentially trigger a call to notice_ml (PQisBusy documents that), then please submit this here or as a PR.

Could you please test the latest development branch? If this fix works for you, I'll make a new release.

paurkedal commented 4 years ago

That fixes the version above, but an async version of it now segfaults instead of deadlocking:

let statements = [
  "CREATE TEMPORARY TABLE notice_processor_and_async_1 \
    (id serial PRIMARY KEY, name text NOT NULL)";
  "CREATE TEMPORARY TABLE notice_processor_and_async_2 \
    (id integer PRIMARY KEY REFERENCES notice_processor_and_async_1)";
  "DROP TABLE notice_processor_and_async_1 CASCADE";
]

let wait_for_result c =
  c#consume_input;
  while c#is_busy do
    ignore (Unix.select [Obj.magic c#socket] [] [] (-.1.0));
    c#consume_input
  done

let fetch_result c = wait_for_result c; c#get_result

let fetch_single_result c =
  match fetch_result c with
  | None -> assert false
  | Some r -> assert (fetch_result c = None); r

let () = Printexc.register_printer @@ function
 | Postgresql.Error err -> Some (Postgresql.string_of_error err)
 | _ -> None

let () =
  let conn = new Postgresql.connection ~conninfo:Sys.argv.(1) () in
  conn#set_notice_processor print_endline;
  statements |> List.iter begin fun stmt ->
    conn#send_query stmt;
    let r = fetch_single_result conn in
    assert (r#status = Postgresql.Command_ok)
  end

I'll have a look though the stubs tomorrow to see if I find similar issues.

paurkedal commented 4 years ago

Oh, that was already the version I submitted. Of course, otherwise there would have been no need for PQisBusy. It's getting a bit late here, will have a closer look tomorrow. Your fix was a good hint.

paurkedal commented 4 years ago

I had a look at the ways libpq may call the notice processor, and it seems it is quite pervasive, esp. due to pqInternalNotice. But I'm wondering, what do you mean by the comment that "a runtime feature is needed to fully support this" in notice_ml? Would that be a shortcut to deal with it, or would it be something needed even if we spot and release the runtime system wherever notice_ml might be called?

mmottl commented 4 years ago

If I remember correctly, there was and probably still is no feature that would allow me to detect whether the currently executing thread is already holding the runtime lock or to only acquire it if it doesn't. If that feature existed, only notice_ml would need to deal with the runtime lock. But we'd still need to make sure that all functions that could potentially call notice_ml at least protect needed values from being reclaimed.

paurkedal commented 4 years ago

Right, so we need to make sure to use CAMLparam1 etc. wherever the runtime might be released.

This will make a lot of the today trivial functions, more heavy weight, but the only options I can see would be to buffer the messages (up to a limit) and pop them off at strategic points, which has its own disadvantages.

mmottl commented 4 years ago

CAMLparam1 is pretty cheap, but releasing and acquiring the runtime lock is very expensive. We should certainly not be doing that on each simple call.

Another solution would be to submit a PR to the OCaml team that would allow a thread to determine whether it is holding the runtime lock. This could probably be easily achieved by replacing the thread id counter in the runtime with the result of a call to pthread_self and providing a function to access the id of the currently executing thread. Sadly, getting PRs into production would probably take a long time. We might as well wait for multicore support in OCaml.

paurkedal commented 4 years ago

I agree we shouldn't do excessive locking, and I would worry about overlooking a notice invocation or that a future libpq introduced new notices. Could we instead add something like

method set_notice_processing : [< `Stderr | `Quiet] -> unit

for now? Using a variant rather than a boolean allows more options if we should need before a on-demand locking is on the table, like copying messages to a ring-buffer.

At least this should suffice for paurkedal/ocaml-caqti#33 as the main motivation is to avoid messages on stderr. And I agree, while the notices and warnings from PostgreSQL are educational when using the command line, they seem less relevant once the SQL statements are embedded in software.

vog commented 4 years ago

I'm the author of paurkedal/ocaml-caqti#33 and would like to confirm that for my use case, the proposed solution would be a huge step in the right direction.

mmottl commented 4 years ago

I've merged #38 to address this issue and will make a new release now. Sorry for the delay, a global pandemic got in my way...

mmottl / postgresql-ocaml

Notice processor and async operation #37