tcdi / plrust

A Rust procedural language handler for PostgreSQL
PostgreSQL License
1.1k stars 32 forks source link

Statement timeout during `cargo build` leads to crash #413

Open jscheid opened 1 month ago

jscheid commented 1 month ago

Hi, thanks for all your hard work on plrust. So this is an interesting one. I've been hitting an "abnormal server process exit" and tracked it down to a statement timeout setting:

I had recently configured a default statement_timeout of 45000 (ms) and when I recompiled one of my plrust functions, it started downloading and recompiling external crates which took longer than that. Exactly after 45 seconds the build failed -- presumably because it received some signal from Postgres, which is expected of course -- but shouldn't it exit more cleanly, rather than taking down ~the whole worker process~ all worker processes?

Logs ``` WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. [...] PQconsumeInput() ERROR: 0: `cargo build` failed Location: /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/convert/mod.rs:716 `cargo build` stderr: Updating crates.io index Downloading crates ... Downloaded thiserror-impl v1.0.62 Downloaded thiserror v1.0.62 Downloaded syn v2.0.71 Compiling thiserror v1.0.62 Compiling syn v2.0.71 Compiling getrandom v0.2.15 Compiling uuid v1.10.0 Compiling bindgen v0.68.1 Compiling serde_derive v1.0.204 Compiling thiserror-impl v1.0.62 Compiling enum-map-derive v0.17.0 Compiling enum-map v2.7.3 Compiling pest v2.7.11 Compiling semver-parser v0.10.2 Compiling serde v1.0.204 Compiling semver v0.11.0 Compiling rustc_version v0.3.3 Compiling atomic-traits v0.3.0 Compiling toml_datetime v0.6.6 Compiling serde_spanned v0.6.6 Compiling serde_json v1.0.120 Compiling toml_edit v0.22.15 Compiling serde_cbor v0.11.2 Compiling toml v0.8.14 Compiling cargo_toml v0.16.3 Compiling pgrx-pg-config v0.11.0 Compiling pgrx-pg-sys v0.11.0 Compiling pgrx v0.11.0 Compiling plrust-trusted-pgrx v1.2.7 Source Code: // SNIP // (I have the full error message here and can make it available if needed, // but I don't think it matters. The function doesn't declare any additional dependencies.) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ BACKTRACE ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ⋮ 2 frames hidden ⋮ 3: plrust::user_crate::build::FnBuild::build_internal::hcdeaf7e52dfecfae at : 4: plrust::plrust::compile_function::h90a18482c5d32d1b at : 5: pgrx_pg_sys::submodules::panic::run_guarded::hbd55bed297e7b964 at : 6: plrust_validator_wrapper at : 7: FunctionCall1Coll at : 8: OidFunctionCall1Coll at : 9: ProcedureCreate at : 10: CreateFunction at : 11: standard_ProcessUtility at : 12: pgrx_pg_sys::include::pg14::standard_ProcessUtility::he0425f28cedd16c0 at : 13: pgrx_pg_sys::submodules::panic::run_guarded::h2575082d6cd79498 at : 14: plrust::hooks::plrust_process_utility_hook::h95d150ca0faf3908 at : 15: PortalRun at : 16: PostgresMain at : 17: PostmasterMain at : 18: main at : 19: __libc_start_main at : Run with COLORBT_SHOW_HIDDEN=1 environment variable to disable frame filtering. Run with RUST_BACKTRACE=full to include source snippets. 2024-07-16 12:40:58.612 GMT [217] DETAIL: 0: pgrx_pg_sys::submodules::panic::register_pg_guard_panic_hook::{{closure}} 1: std::panicking::rust_panic_with_hook 2: std::panicking::begin_panic_handler::{{closure}} 3: std::sys_common::backtrace::__rust_end_short_backtrace 4: rust_begin_unwind 5: core::panicking::panic_fmt 6: pgrx_pg_sys::submodules::panic::run_guarded 7: plrust_validator_wrapper 8: FunctionCall1Coll 9: OidFunctionCall1Coll 10: ProcedureCreate 11: CreateFunction 12: 13: standard_ProcessUtility 14: pgrx_pg_sys::include::pg14::standard_ProcessUtility 15: pgrx_pg_sys::submodules::panic::run_guarded 16: plrust::hooks::plrust_process_utility_hook 17: 18: 19: PortalRun 20: 21: PostgresMain 22: 23: PostmasterMain 24: main 25: __libc_start_main 26: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. ``` ``` psql# SELECT version(); ┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │ version │ ├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ PostgreSQL 14.12 (Debian 14.12-1.pgdg110+1) on aarch64-unknown-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit │ └─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ ```

~Also, I'm not sure if I understand the error message fully, is it just the one worker process exiting or is this the equivalent of a full postmaster crash? The mention about possibly corrupted shared memory makes me think it's the latter. Unfortunately I can't test this as reliably anymore because the dependencies are now already downloaded and built (because I tested successfully with the statement timeout removed.)~

(I realized that I can test this easily by removing our prewarming code, and I can confirm that all workers are crashing.)

~Now, the most pressing question I'm facing is how worried I should be about the unclean exit. Could you shed some light on this? I can't think of a foolproof way to prevent this scenario from ever occurring in production so it would be good to know what the impact might be in that case.~

Now, the most pressing question I'm facing is whether this is to do with the specific way plrust is set up in our dev env or whether the same could happen in our RDS production environment. I'm going to be able to test this more later today, but perhaps you can shed any light on this?


This incident also brought to mind a whole other question: you'll notice in the logs that plrust is downloading and building external crates, this is of course the reason I ran into the timeout in the first place. However, we're priming the workdir (along these lines) and so this wouldn't be reflecting initial setup but rather it must be an implicit (automatic) upgrade of existing dependencies, if I'm not missing anything. (The plrust function that the logs are for isn't specifying any additional crates by the way.)

The fact that it's downloading a brand-new version of thiserror only released a few days ago (long after I've built the Docker image in question and even longer after plrust 1.27 was released) also speaks to this theory.

If this is caused by automatic upgrades, and not due to a misconfiguration (of the plrust installation) in our dev env, I can see a range of potential problems, from supply chain security, to server instability due to bugs in newer crate versions, to fluctuating duration of migrations, and potentially subtle differences in plrust function behavior depending on when exactly they were compiled.

Would this (automatic upgrading of crate versions) have something to do with the way our Dockerfile is installing plrust in the dev env, or would this also be expected behavior on RDS, say? Is there any configuration we could apply to pin crate versions so that they can only be upgraded manually?

eeeebbbbrrrr commented 1 month ago

(There's a lot going on in this issue -- in the future, please consider making one issue per topic/idea)

Crashing during at statement_timeout event

This is definitely not by design. :/ We'll have to look into this. Doesn't sound like it'll be hard to recreate

plrust is downloading and building external crates

This is what cargo does. You could adjust the plrust.path_override GUC to get your own "cargo" wrapper script on the PATH that maybe adds --offline or any other arguments you may want. https://plrust.io/config-pg.html#plrustpath_override-string

I don't have a "best practices" suggestion for you here. All of this really depends on what your requirements are. I do think, however, pl/rust provides enough knobs for you to direct the way you need.

@workingjubilee might have some ideas here...

jscheid commented 1 month ago

in the future, please consider making one issue per topic/idea

You're right, my apologies. Would you like me to open a separate issue for discussing the auto-upgrading question?

You could adjust the plrust.path_override GUC

Something tells me that's not possible on RDS, but I'll investigate this later today with our ops team, thanks for the pointers so far.