jfischoff / tmp-postgres

Create temporary postgres instances
BSD 3-Clause "New" or "Revised" License
53 stars 18 forks source link

Test stuck indefinitely when posgres crash does not propagate up #273

Open nh2 opened 3 years ago

nh2 commented 3 years ago

Hi, for static-haskell-nix I run a lot of packages in its CI.

For tmp-postgres, that CI seems to have discovered a way to get the test suite stuck indefinitely after a postgres crash (unclear of what nature that crash exactly is, coredumpctl suggests signals 3 and 6 were involved).

The failure doesn't seem to be deterministic, because sometimes the tests ran through fine.

With the test hanging for hours, this is the process tree as shown in htop:

 905433 root       20   0 1120M 52492 10864 S  0.0  0.3  1:10.64 │  ├─ nix-daemon 905424
1086601 nixbld16   20   0  2788  2460  1312 S  0.0  0.0  0:02.16 │  │  └─ bash -e /nix/store/9krlzvny65gdc8s7kpb6lkx8cd02c25b-default-builder.sh
1110494 nixbld16   20   0  5188  2372  1260 S  0.0  0.0  0:00.07 │  │     ├─ initdb --no-sync --pgdata=/tmp/tmp-postgres-data-c1955f83d7f7fd80
1110552 nixbld16   20   0  196M 36532 30912 S  0.0  0.2  0:00.91 │  │     │  └─ postgres --single -F -O -j -c search_path=pg_catalog -c exit_on_error=true template1
1107329 nixbld16   20   0  1.0T 30608 10392 S  0.0  0.2  0:00.47 │  │     └─ Setup test

This is how I observe the postgres crash/shutdown:

# journalctl -e -n 100000 | grep postgres
Jul 12 08:24:37 hetzner systemd-coredump[1110377]: Resource limits disable core dumping for process 1110370 (postgres).
Jul 12 08:24:37 hetzner systemd-coredump[1110377]: Process 1110370 (postgres) of user 30016 dumped core.
TIME                            PID   UID   GID SIG COREFILE  EXE
Mon 2021-07-12 04:37:52 UTC  628495 30013 30000   3 none      /nix/store/kapbll03fh361xxkrhafv48zymawzm6z-postgresql-11.12/bin/postgres
Mon 2021-07-12 04:37:55 UTC  628649 30013 30000   6 none      /nix/store/kapbll03fh361xxkrhafv48zymawzm6z-postgresql-11.12/bin/postgres
Mon 2021-07-12 08:24:37 UTC 1110370 30016 30000   3 none      /nix/store/kapbll03fh361xxkrhafv48zymawzm6z-postgresql-11.12/bin/postgres

Unfortunately I didn't have core dump files enabled so I can't gdb into the postgres process to get more info.

To provide additional data, here are strace invocations of the 3 involved processes, in case that helps:

``` # strace -fyp "$(pidof initdb)" strace: Process 1110494 attached writev(4, [{iov_base="COMMENT ON FUNCTION iso8859_1_to"..., iov_len=138}, {iov_base=NULL, iov_len=0}], 2 # strace -fyp "$(pidof postgres)" strace: Process 1110552 attached futex(0x7fffecd311b8, FUTEX_WAIT, 4294967295, NULL # ps aux | grep Setup nixbld16 1107329 0.0 0.1 1073757344 30608 ? Sl 08:23 0:00 ./Setup test # strace -fyp "$(pidof 'Setup')" strace: Process 1107329 attached with 5 threads [pid 1107334] restart_syscall(<... resuming interrupted read ...> [pid 1107344] futex(0x7ffff7e0793c, FUTEX_WAIT_PRIVATE, 2, NULL [pid 1107329] futex(0x7fffffff9dbc, FUTEX_WAIT_PRIVATE, 2, NULL [pid 1107331] futex(0x7ffff7e98a0c, FUTEX_WAIT_PRIVATE, 2, NULL [pid 1107333] futex(0x7ffff7e98a0c, FUTEX_WAKE_PRIVATE, 1 [pid 1107331] <... futex resumed>) = -1 EAGAIN (Resource temporarily unavailable) [pid 1107333] <... futex resumed>) = 0 [pid 1107333] epoll_pwait(4, [pid 1107331] read(3, [pid 1107333] <... epoll_pwait resumed>[], 64, 0, NULL, 8) = 0 [pid 1107331] <... read resumed>"\256\1\0\0\0\0\0\0", 8) = 8 [pid 1107333] epoll_pwait(4, [pid 1107331] read(3, [pid 1107333] <... epoll_pwait resumed>[], 64, 0, NULL, 8) = 0 [pid 1107333] epoll_pwait(4, [pid 1107331] <... read resumed>"\1\0\0\0\0\0\0\0", 8) = 8 [pid 1107331] read(3, "\1\0\0\0\0\0\0\0", 8) = 8 [pid 1107331] read(3, "\1\0\0\0\0\0\0\0", 8) = 8 [pid 1107331] read(3, "\1\0\0\0\0\0\0\0", 8) = 8 [pid 1107331] read(3, "\1\0\0\0\0\0\0\0", 8) = 8 [pid 1107331] read(3, "\1\0\0\0\0\0\0\0", 8) = 8 [pid 1107331] read(3, "\1\0\0\0\0\0\0\0", 8) = 8 [pid 1107331] read(3, "\1\0\0\0\0\0\0\0", 8) = 8 [pid 1107331] read(3, "\1\0\0\0\0\0\0\0", 8) = 8 [pid 1107331] read(3, "\1\0\0\0\0\0\0\0", 8) = 8 [pid 1107331] read(3, "\1\0\0\0\0\0\0\0", 8) = 8 [pid 1107331] read(3, "\1\0\0\0\0\0\0\0", 8) = 8 [pid 1107331] read(3, "\1\0\0\0\0\0\0\0", 8) = 8 [pid 1107331] read(3, "\1\0\0\0\0\0\0\0", 8) = 8 [pid 1107331] read(3, "\1\0\0\0\0\0\0\0", 8) = 8 [pid 1107331] read(3, "\1\0\0\0\0\0\0\0", 8) = 8 [pid 1107331] read(3, "\1\0\0\0\0\0\0\0", 8) = 8 [pid 1107331] read(3, "\1\0\0\0\0\0\0\0", 8) = 8 [pid 1107331] read(3, "\1\0\0\0\0\0\0\0", 8) = 8 [pid 1107331] read(3, "\1\0\0\0\0\0\0\0", 8) = 8 [pid 1107331] read(3, "\1\0\0\0\0\0\0\0", 8) = 8 [pid 1107331] read(3, "\1\0\0\0\0\0\0\0", 8) = 8 [pid 1107331] read(3, "\1\0\0\0\0\0\0\0", 8) = 8 [pid 1107331] read(3, "\1\0\0\0\0\0\0\0", 8) = 8 [pid 1107331] read(3, "\1\0\0\0\0\0\0\0", 8) = 8 [pid 1107331] read(3, "\1\0\0\0\0\0\0\0", 8) = 8 [pid 1107331] read(3, "\1\0\0\0\0\0\0\0", 8) = 8 [pid 1107331] read(3, "\1\0\0\0\0\0\0\0", 8) = 8 [pid 1107331] read(3, "\1\0\0\0\0\0\0\0", 8) = 8 [pid 1107331] read(3, "\1\0\0\0\0\0\0\0", 8) = 8 [pid 1107331] read(3, "\1\0\0\0\0\0\0\0", 8) = 8 [pid 1107331] write(10, "\377\0\0\0\0\0\0\0", 8) = 8 [pid 1107334] <... restart_syscall resumed>) = 1 [pid 1107331] read(3, [pid 1107334] read(10, "\377\0\0\0\0\0\0\0", 8) = 8 [pid 1107334] futex(0x7ffff7e0793c, FUTEX_WAKE_PRIVATE, 1) = 1 [pid 1107344] <... futex resumed>) = 0 [pid 1107334] poll([{fd=8, events=POLLIN}, {fd=10, events=POLLIN}], 2, -1 [pid 1107344] rt_sigprocmask(SIG_BLOCK, [INT], [], 8) = 0 [pid 1107331] <... read resumed>"\1\0\0\0\0\0\0\0", 8) = 8 [pid 1107331] read(3, [pid 1107344] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 [pid 1107344] futex(0x7ffff7e0793c, FUTEX_WAIT_PRIVATE, 2, NULL [pid 1107331] <... read resumed>"\1\0\0\0\0\0\0\0", 8) = 8 [pid 1107331] futex(0x7ffff7e98a0c, FUTEX_WAIT_PRIVATE, 2, NULL ```

The CI run in which this occurred is here.

If I were to make a super rough guess at what's happening, I'd say that there's a code path that allows some postgres process to die (for whatever reason), but that information isn't propagated up (or blocked on some IO pipe) to shut down the other processes and display an error message.


This is really just a drive-by issue report, as I'm not currently a user of tmp-postgres nor have a good understanding of its working.

But maybe it can be useful for you to catch the odd async-exception-not-handled or other race condition, thus improving the package.

Cheers!

jfischoff commented 3 years ago

So just a few quick thoughts. The htop output is showing a postgres process that is a child of initdb. This is a situation that occurs early on during creation of the initial db cluster. This postgres instance is only used for this purpose and is not the postgres process that tests will communicate with during the testing process.

Here is where the initdb is called: https://github.com/jfischoff/tmp-postgres/blob/593e3ebcb7643afd6095f6269de3552d01c7ff40/src/Database/Postgres/Temp/Internal/Core.hs#L317

I would run tmp-postgres with full verbosity to get all of the output from initdb.

You can do that with this config: https://github.com/jfischoff/tmp-postgres/blob/593e3ebcb7643afd6095f6269de3552d01c7ff40/src/Database/Postgres/Temp/Internal.hs#L257

Although, I would assume the output would be visible already in the strace.