orioledb / orioledb

OrioleDB – building a modern cloud-native storage engine (... and solving some PostgreSQL wicked problems)  🇺🇦
https://orioledb.com
Other
2.89k stars 110 forks source link

Checkpointer gets stuck in case of error on S3 workers #373

Open za-arthur opened 1 week ago

za-arthur commented 1 week ago

Tested on main branch, commit https://github.com/orioledb/orioledb/commit/bd8e32d0ebaafd0ea3ec3074233b65167f3b6fb7. Possibly related: https://github.com/orioledb/orioledb/issues/326

In case of an error within an S3 worker checkpointer process gets stuck, killall postgres doesn't terminate the instance.

To reproduce the issue easily one can apply the patch:

diff --git a/src/s3/worker.c b/src/s3/worker.c
index 8d843c3..355befd 100644
--- a/src/s3/worker.c
+++ b/src/s3/worker.c
@@ -106,6 +106,8 @@ s3process_task(uint64 taskLocation)
                if (filename[0] == '.' && filename[1] == '/')
                        filename += 2;

+               elog(ERROR, "test error");
+
                objectname = psprintf("data/%u/%s",
                                                          task->typeSpecific.writeFile.chkpNum,
                                                          filename);

and run tests t.s3_test.S3Test.test_s3_check_control. Tests will get stuck after executing killall postgres only the main postmaster and checkpointer processes will remain. The backtrace of the checkpointer process:

  * frame #0: 0x000000018f39befc libsystem_kernel.dylib`kevent + 8
    frame #1: 0x0000000104ced220 postgres`WaitEventSetWait [inlined] WaitEventSetWaitBlock(set=0x0000000144809888, cur_timeout=-1, occurred_events=0x000000016b451878, nevents=1) at latch.c:1695:7 [opt]
    frame #2: 0x0000000104ced1d0 postgres`WaitEventSetWait(set=0x0000000144809888, timeout=-1, occurred_events=<unavailable>, nevents=1, wait_event_info=<unavailable>) at latch.c:1481:8 [opt]
    frame #3: 0x0000000104cecffc postgres`WaitLatch(latch=<unavailable>, wakeEvents=33, timeout=-1, wait_event_info=134217760) at latch.c:513:6 [opt]
    frame #4: 0x0000000104cfa944 postgres`ConditionVariableTimedSleep(cv=0x000000011889afa8, timeout=-1, wait_event_info=134217760) at condition_variable.c:163:10 [opt]
    frame #5: 0x0000000104cfa82c postgres`ConditionVariableSleep(cv=<unavailable>, wait_event_info=<unavailable>) at condition_variable.c:100:9 [opt] [artificial]
    frame #6: 0x000000010624addc orioledb.dylib`s3_queue_wait_for_location(location=4620) at queue.c:341:3 [opt]
    frame #7: 0x0000000106246d04 orioledb.dylib`s3_perform_backup(flags=108, maxLocation=4620) at checkpoint.c:284:2 [opt]
    frame #8: 0x000000010622f578 orioledb.dylib`o_perform_checkpoint(redo_pos=22068768, flags=108) at checkpoint.c:1345:3 [opt]
    frame #9: 0x0000000104a58a08 postgres`CheckPointGuts(checkPointRedo=22068768, flags=108) at xlog.c:7061:3 [opt]
    frame #10: 0x0000000104a58290 postgres`CreateCheckPoint(flags=108) at xlog.c:6731:2 [opt]
    frame #11: 0x0000000104c7ee00 postgres`CheckpointerMain at checkpointer.c:479:5 [opt]
    frame #12: 0x0000000104c7d0a4 postgres`AuxiliaryProcessMain(auxtype=CheckpointerProcess) at auxprocess.c:153:4 [opt]
    frame #13: 0x0000000104c82138 postgres`StartChildProcess(type=CheckpointerProcess) at postmaster.c:5343:3 [opt]
    frame #14: 0x0000000104c81bb0 postgres`PostmasterMain(argc=<unavailable>, argv=<unavailable>) at postmaster.c:1436:21 [opt]
    frame #15: 0x0000000104bab224 postgres`main(argc=3, argv=0x00006000031e5460) at main.c:198:3 [opt]
    frame #16: 0x000000018f054274 dyld`start + 2840
pashkinelfe commented 2 hours ago

I think we need to catch possible non-fatal errors in subroutines called from s3process_task (i.e. s3_put_file, s3_put_empty_dir, s3_get_file, s3_get_file_part etc.) and convert them to FATAL via PG_TRY() {..} PG_CATCH() { elog (FATAL); } PG_END_TRY()

I see there're only few places where we can get non-fatal error i.e. in read_file_part() and write_file_part()

@akorotkov do you like this way?