input-output-hk / mithril

Stake-based threshold multi-signatures protocol
https://mithril.network
Apache License 2.0
130 stars 41 forks source link

`testing-preview` and `testing-sanchonet` aggregators panic with `FOREIGN KEY constraint failed` error #2120

Closed jpraynaud closed 3 days ago

jpraynaud commented 4 days ago

Why

The testing-preview and testing-sanchonet networks are down with the following panic:

{"msg":">> open_signer_registration_round","v":0,"name":"mithril-aggregator","level":20,"time":"2024-11-17T15:02:28.980533893Z","hostname":"7e7d34d1d6cc","pid":1,"src":"AggregatorRunner","time_point":"TimePoint {\n    epoch: Epoch(\n        521,\n    ),\n    immutable_file_number: 10429,\n    chain_point: ChainPoint {\n        slot_number: SlotNumber(\n            45066688,\n        ),\n        block_number: BlockNumber(\n            2242439,\n        ),\n        block_hash: \"9dd84673cb69436be7db81ec55499f012d3a0d339f727ee8b558dccc4777cfea\",\n    },\n}"}
thread 'tokio-runtime-worker' panicked at /home/runner/work/mithril/mithril/internal/mithril-persistence/src/sqlite/cursor.rs:35:51:
FOREIGN KEY constraint failed (code 19)
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: <mithril_persistence::sqlite::cursor::EntityCursor<T> as core::iter::traits::iterator::Iterator>::next
   3: mithril_persistence::sqlite::connection_extensions::ConnectionExtensions::fetch_first
   4: <mithril_aggregator::database::repository::signer_registration_store::SignerRegistrationStore as mithril_aggregator::store::verification_key_store::VerificationKeyStorer>::prune_verification_keys::{{closure}}
   5: <mithril_aggregator::signer_registerer::MithrilSignerRegisterer as mithril_aggregator::signer_registerer::SignerRegistrationRoundOpener>::open_registration_round::{{closure}}
   6: <mithril_aggregator::runtime::runner::AggregatorRunner as mithril_aggregator::runtime::runner::AggregatorRunnerTrait>::open_signer_registration_round::{{closure}}
   7: mithril_aggregator::runtime::state_machine::AggregatorRuntime::cycle::{{closure}}
   8: mithril_aggregator::commands::serve_command::ServeCommand::execute::{{closure}}::{{closure}}
   9: tokio::runtime::task::core::Core<T,S>::poll
  10: tokio::runtime::task::harness::Harness<T,S>::poll
  11: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
  12: tokio::runtime::scheduler::multi_thread::worker::Context::run
  13: tokio::runtime::context::runtime::enter_runtime
  14: tokio::runtime::scheduler::multi_thread::worker::run
  15: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll
  16: tokio::runtime::task::core::Core<T,S>::poll
  17: tokio::runtime::task::harness::Harness<T,S>::poll
  18: tokio::runtime::blocking::pool::Inner::run
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
{"msg":"shutting down HTTP server after receiving signal","v":0,"name":"mithril-aggregator","level":40,"time":"2024-11-17T15:02:28.983355811Z","hostname":"7e7d34d1d6cc","pid":1,"src":"MetricsServer"}
Error: task 14 panicked with message "FOREIGN KEY constraint failed (code 19)"

Stack backtrace:
   0: anyhow::error::<impl core::convert::From<E> for anyhow::Error>::from
   1: mithril_aggregator::commands::serve_command::ServeCommand::execute::{{closure}}
   2: tokio::runtime::park::CachedParkThread::block_on
   3: mithril_aggregator::main
   4: std::sys::backtrace::__rust_begin_short_backtrace
   5: std::rt::lang_start::{{closure}}
   6: std::rt::lang_start_internal
   7: main
   8: __libc_start_main
   9: _start

What

Investigate and fix the problem that prevents the aggregator from starting in testing-preview and testing-sanchonet.

How

jpraynaud commented 4 days ago

Following a problem with #1957, we have made a manual operation on the database with sqlite3. We have deleted some open messages, but the delete cascade was not executed as this is not a feature activated by default with this tool.

The pruning of the verification keys failed because some single signatures associated with the aforementioned open messages were not properly deleted and created the constraint error which triggered a panic of the node.

We have manually deleted the remaining single signatures and the aggregators of testing-preview and testing-sanchonet have resumed successfully.