input-output-hk / mithril

Stake-based threshold multi-signatures protocol
https://mithril.network
Apache License 2.0
128 stars 41 forks source link

Upkeep service for aggregator & signer #1791

Closed Alenar closed 4 months ago

Alenar commented 4 months ago

Content

This PR add a new service to mithril-aggregator and mithril-signer, the UpkeepService:

Note: the vacuum is not run on the cardano transaction database as it can be so large that it takes more than an hour on the mainnet.

Pre-submit checklist

Issue(s)

Closes #1707

github-actions[bot] commented 4 months ago

Test Results

    4 files  ± 0     51 suites  ±0   8m 51s :stopwatch: -3s 1 114 tests +13  1 114 :white_check_mark: +13  0 :zzz: ±0  0 :x: ±0  1 262 runs  +13  1 262 :white_check_mark: +13  0 :zzz: ±0  0 :x: ±0 

Results for commit 35138ade. ± Comparison against base commit 5816ed46.

This pull request removes 1 and adds 14 tests. Note that renamed tests count towards both. ``` mithril-persistence ‑ sqlite::test::calling_vacuum_on_an_empty_in_memory_db_should_not_fail ``` ``` mithril-aggregator ‑ http_server::routes::reply::tests::test_server_error_convert_signer_registration_round_not_yet_opened_to_503 mithril-aggregator ‑ http_server::routes::reply::tests::test_server_error_convert_std_error_to_500_by_default mithril-aggregator ‑ http_server::routes::reply::tests::test_server_error_convert_wrapped_sqlite_busy_error_to_503 mithril-aggregator ‑ runtime::runner::tests::test_upkeep mithril-aggregator ‑ services::upkeep::tests::test_cleanup_database mithril-aggregator ‑ services::upkeep::tests::test_doesnt_cleanup_db_if_any_entity_is_locked mithril-common ‑ signed_entity_type_lock::tests::has_locked_entities mithril-persistence ‑ sqlite::cleaner::tests::cleanup_empty_file_without_wal_db_should_not_crash mithril-persistence ‑ sqlite::cleaner::tests::cleanup_empty_in_memory_db_should_not_crash mithril-persistence ‑ sqlite::cleaner::tests::test_truncate_wal … ```

:recycle: This comment has been updated with latest results.

Alenar commented 4 months ago

I'm wondering about what will happen during the cleanup for the REST API of the aggregator: will it be locked? Can we handle that situation with a 503 response?

@jpraynaud Currently if the repository layer fails because of an error (most likely a SQLITE_BUSY error in this case) the REST api returns a 500 response most of the time.

Returning another error code is possible:

To do so we need to rework the pub fn internal_server_error<T: Into<InternalServerError>>(message: T) -> Box<dyn warp::Reply> from mithril-aggregator/src/http_server/routes/reply.rs. It would only take StdError as parameter so it could downcast ref it to a sqlite crate error, check the error code, see if it correspond to an SQLITE_BUSY and if yes changes the response from a 500 to a 503.

What do you think ?

jpraynaud commented 4 months ago

I'm wondering about what will happen during the cleanup for the REST API of the aggregator: will it be locked? Can we handle that situation with a 503 response?

@jpraynaud Currently if the repository layer fails because of an error (most likely a [SQLITE_BUSY](https://www.sqlite.org/rescode.html#busy) error in this case) the REST api returns a 500 response most of the time.

Returning another error code is possible:

To do so we need to rework the pub fn internal_server_error<T: Into<InternalServerError>>(message: T) -> Box<dyn warp::Reply> from mithril-aggregator/src/http_server/routes/reply.rs. It would only take StdError as parameter so it could downcast ref it to a sqlite crate error, check the error code, see if it correspond to an SQLITE_BUSY and if yes changes the response from a 500 to a 503.

What do you think ?

I was thinking about a warp middleware that would return a 503 when the UpkeepService is running the cleanup (and that would be used on the routes which are impacted by the cleanup). But your solution can work as well :+1: