Relay dies and doesn't come back up (PostgreSQL)

tfreedman commented 6 months ago

Just testing nostr-rs-relay, and threw some traffic at it. I managed to get it into a state where the app is now unusable, as it will never resume listening for new traffic even when restarted. Here's the error message:

thread 'tokio-ws-4' panicked at src/repo/postgres.rs:200:34:
called `Result::unwrap()` on an `Err` value: Database(PgDatabaseError { severity: Error, code: "54000", message: "index row size 4544 exceeds btree version 4 maximum 2704 for index \"tag_value_idx\"", detail: Some("Index row references tuple (72445,9) in relation \"tag\"."), hint: Some("Values larger than 1/3 of a buffer page cannot be indexed.\nConsider a function index of an MD5 hash of the value, or use full text indexing."), position: None, where: None, schema: Some("public"), table: Some("tag"), column: None, data_type: None, constraint: Some("tag_value_idx"), file: Some("nbtutils.c"), line: Some(2678), routine: Some("_bt_check_third_page") })
stack backtrace:
   0: rust_begin_unwind
             at /build/rustc-kAv1jW/rustc-1.75.0+dfsg0ubuntu1~bpo0/library/std/src/panicking.rs:645:5
   1: core::panicking::panic_fmt
             at /build/rustc-kAv1jW/rustc-1.75.0+dfsg0ubuntu1~bpo0/library/core/src/panicking.rs:72:14
   2: core::result::unwrap_failed
             at /build/rustc-kAv1jW/rustc-1.75.0+dfsg0ubuntu1~bpo0/library/core/src/result.rs:1653:5
   3: <nostr_rs_relay::repo::postgres::PostgresRepo as nostr_rs_relay::repo::NostrRepo>::write_event::{{closure}}
   4: nostr_rs_relay::db::db_writer::{{closure}}
   5: tokio::runtime::task::core::Core<T,S>::poll
   6: tokio::runtime::task::harness::Harness<T,S>::poll
   7: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
   8: tokio::runtime::scheduler::multi_thread::worker::Context::run
   9: tokio::runtime::context::runtime::enter_runtime
  10: tokio::runtime::scheduler::multi_thread::worker::run
  11: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll
  12: tokio::runtime::task::core::Core<T,S>::poll
  13: tokio::runtime::task::harness::Harness<T,S>::poll
  14: tokio::runtime::blocking::pool::Inner::run

scsibug commented 5 months ago

Thanks for the report. It looks like we should have never put an index on the tag values on the PG side, since there is a maximum size we can easily hit.

scsibug commented 5 months ago

I do not run the postgres backend, but I'll try to get a schema change to drop this index.

In the meantime - @tfreedman if you want, you could try to drop the "tag_value_idx" index, and see if that resolves the problem, and allows the relay to startup.

We do need some kind of index here, at first I (wrongly) thought it was covered with a different one. We may want to limit tag value size, or hash large values for the index

scsibug commented 5 months ago

A better short term fix will be to simply avoid unwrapping the transaction result, and just log an error for these events with super lomg tag values.

v0l commented 5 months ago

@scsibug I think i had this index dropped too

scsibug commented 5 months ago

I'm just starting to play with the postgres backend more. I was able to duplicate this issue, and then resolve it by preventing the unwrap. Now it returns a generic error to the client, which is much better than panicking the writing thread.

Is this index critical for performance? I would have thought so. But I think there is some opportunity for optimizing the indexes (especially the unique_constraint_name index, which is quite large) on the PG side.

scsibug commented 2 months ago

Closing since the panic should be fixed.

scsibug / nostr-rs-relay

Relay dies and doesn't come back up (PostgreSQL) #196