Closed nh2 closed 1 year ago
I also have uneasy feeling, as any of now-still-unsafe functions may start to notify in the future. Whether they call notifier is not in documentation, and would be hard to test.
Yes, implementation change is a risk. Even more likely than notifying that they may notify, they may call libpq_gettext()
.
I guess that struct
accessor functions like PQuser
are relatively certain to stay stuct accessor functions, but we don't have guarantees.
Personally I'd really like to have a flag to the GHC RTS that (with some overhead costs) tracks whether any unsafe
call calls back into Haskell, and terminates the program. That flag could then be used in test suites.
How bad would be to mark all PGConn/PGResult related functions as
safe
?
safe
calls are made on a separate OS thread so that they cannot block the runtime, and that scheduling is costly (compared to struct access or cached IO, but not to real IO like network or disk access).
On https://gitlab.haskell.org/ghc/ghc/-/issues/13296#note_132110 Simon Marlow estimates the safe
overhead to 100 ns.
So the answer would depend on whether there are use cases that call one of these functions very quickly in a loop, e.g. to parse a multi-GB query by parts, or something like that. I have not yet investigated how likely that is to occur.
In any case, I propose that we first merge this fix to fix certain hangs, and then a separate investigation can be launched to see whether a general safer default would be sensible.
For some googleability, and if others have similar problems, this is how I found the cause of the hang:
ghc-options: -debug
to enable the debug RTS.dist/build/tests/tests +RTS -N16 -v
(-v
is the flag that turns on event loggint, thus printing stuff like GC events).This prints output like the following (only including the last lines):
cap 1: thread 3 stopped (yielding)
cap 2: thread 397 stopped (yielding)
cap 3: thread 319 stopped (yielding)
cap 4: thread 136 stopped (yielding)
cap 5: thread 7 stopped (suspended while making a foreign call)
cap 6: waking up thread 370 on cap 6
cap 7: requesting sequential GC
cap 8: thread 315 stopped (yielding)
cap 9: thread 11 stopped (yielding)
cap 10: thread 333 stopped (yielding)
cap 11: thread 13 stopped (yielding)
cap 12: thread 408 stopped (yielding)
cap 13: running thread 379 (ThreadRunGHC)
cap 14: created thread 416
cap 15: thread 241 stopped (yielding)
This suggested that foreign calls are involved.
gdb -p $PID
, and running thread apply all backtrace
to print all threads' backtraces, I found one with this trace:
Thread 32 (Thread 0x7fd34affd000 (LWP 1714036) "benaco-tests:w"):
#0 0x00007fd3fad11cf5 in __futex_abstimed_wait_common64 () from /nix/store/vjq3q7dq8vmc13c3py97v27qwizvq7fd-glibc-2.33-59/lib/libpthread.so.0
#1 0x00007fd3fad0bc22 in pthread_cond_wait@@GLIBC_2.3.2 () from /nix/store/vjq3q7dq8vmc13c3py97v27qwizvq7fd-glibc-2.33-59/lib/libpthread.so.0
#2 0x00000000039cd0a9 in waitCondition ()
#3 0x0000000003991234 in waitForWorkerCapability ()
#4 0x0000000003991c10 in yieldCapability ()
#5 0x000000000399da16 in scheduleYield ()
#6 0x000000000399ce24 in schedule ()
#7 0x000000000399fe72 in scheduleWaitThread ()
#8 0x0000000003994675 in rts_evalIO ()
#9 0x0000000001ba9e60 in zdbenacozm0zi1zi0zi0zmCSvtJjDyD2JEU6ocAylRawzdBenacoziStorezdbenacozzm0zzi1zzi0zzi0zzmCSvtJjDyD2JEU6ocAylRawzuBenacozziStorezuinlinezzuczzuffizzu6989586621681244124 () <- this is my notice processor, generated by inline-c
#10 0x00007fd3fc69209c in pqGetErrorNotice3 () from /nix/store/ay1gs2am2ani1kyyfjpgbsvl5ynm2vpw-postgresql-9.6.24-lib/lib/libpq.so.5
#11 0x00007fd3fc6923b2 in pqParseInput3 () from /nix/store/ay1gs2am2ani1kyyfjpgbsvl5ynm2vpw-postgresql-9.6.24-lib/lib/libpq.so.5
#12 0x00007fd3fc689415 in PQisBusy () from /nix/store/ay1gs2am2ani1kyyfjpgbsvl5ynm2vpw-postgresql-9.6.24-lib/lib/libpq.so.5
#13 0x0000000002af1a42 in postgresqlzmsimplezm0zi6zi4zmJcdIv2KqvAC7dE8DaaZZEkW_DatabaseziPostgreSQLziSimpleziInternal_zdwgetResult_info () <- Haskell function calling postgresql-simple
#14 0x0000004200e14248 in ?? ()
#15 0x0000000000000000 in ?? ()
The key thing in here is that I saw that PQisBusy()
called back into functions of the Haskell runtime. So I checked whether the wrapper for that function was safe
as needed. But it was unsafe
.
i'd prefer to default to all PGConn
and PGResult
functions to be marked safe. E.g. PQgetlength
is called on each table cell, so the price goes up anyway, (in FromRow
instances of postgresql-simple
e.g.)
So I'd rather error-out to the safe
side, and then mark functions unsafe
when there is motivation and evidence it's fine to do so.
I made all functions safe
in https://github.com/haskellari/postgresql-libpq/pull/48
Very good!
Those functions can be reentrant or do IO.
This fixes a case I encountered on our CI machine where
unsafe
onPQisBusy()
resulted in GHC RTS hangs when a ostgresql notice processor was set to call back into Haskell.