Upkeep service for aggregator & signer

Alenar commented 4 months ago

Content

This PR add a new service to mithril-aggregator and mithril-signer, the UpkeepService:

Responsible for upkeep tasks that must be done periodically
Run at epoch change:
- Aggregator: at the start of the Idle → Ready state machine transition, while the registration round is close.
- Signer: at the start of the Unregistered → Registered state machine transition.
It run the following tasks (but more can be added as it should not be limited to database upkeep):
- Vacuum the main database to compact it and reduce fragmentation.
- Execute a pragma wal_checkpoint(TRUNCATE) on both the main and the cardano transaction database to avoid infinite grow of the wal files.
Database upkeep tasks are skipped if a signed entity type is locked since there would be a high chance that a database is busy (like when the cardano transation preloader is running).
This supersede the startup vacuum that has been added by #1771.

Note: the vacuum is not run on the cardano transaction database as it can be so large that it takes more than an hour on the mainnet.

Pre-submit checklist

Branch
- [x] Tests are provided (if possible)
- [x] Crates versions are updated (if relevant)
- [x] CHANGELOG file is updated (if relevant)
- [x] Commit sequence broadly makes sense
- [x] Key commits have useful messages
PR
- [x] No clippy warnings in the CI
- [x] Self-reviewed the diff
- [x] Useful pull request description
- [x] Reviewer requested
Documentation
- [ ] Update README file (if relevant)
- [ ] Update documentation website (if relevant)
- [ ] Add dev blog post (if relevant)

Issue(s)

Closes #1707

github-actions[bot] commented 4 months ago

Test Results

4 files ± 0 51 suites ±0 8m 51s :stopwatch: -3s 1 114 tests +13 1 114 :white_check_mark: +13 0 :zzz: ±0 0 :x: ±0 1 262 runs +13 1 262 :white_check_mark: +13 0 :zzz: ±0 0 :x: ±0

Results for commit 35138ade. ± Comparison against base commit 5816ed46.

This pull request removes 1 and adds 14 tests. Note that renamed tests count towards both.

``` mithril-persistence ‑ sqlite::test::calling_vacuum_on_an_empty_in_memory_db_should_not_fail ``` ``` mithril-aggregator ‑ http_server::routes::reply::tests::test_server_error_convert_signer_registration_round_not_yet_opened_to_503 mithril-aggregator ‑ http_server::routes::reply::tests::test_server_error_convert_std_error_to_500_by_default mithril-aggregator ‑ http_server::routes::reply::tests::test_server_error_convert_wrapped_sqlite_busy_error_to_503 mithril-aggregator ‑ runtime::runner::tests::test_upkeep mithril-aggregator ‑ services::upkeep::tests::test_cleanup_database mithril-aggregator ‑ services::upkeep::tests::test_doesnt_cleanup_db_if_any_entity_is_locked mithril-common ‑ signed_entity_type_lock::tests::has_locked_entities mithril-persistence ‑ sqlite::cleaner::tests::cleanup_empty_file_without_wal_db_should_not_crash mithril-persistence ‑ sqlite::cleaner::tests::cleanup_empty_in_memory_db_should_not_crash mithril-persistence ‑ sqlite::cleaner::tests::test_truncate_wal … ```

:recycle: This comment has been updated with latest results.

Alenar commented 4 months ago

I'm wondering about what will happen during the cleanup for the REST API of the aggregator: will it be locked? Can we handle that situation with a 503 response?

@jpraynaud Currently if the repository layer fails because of an error (most likely a SQLITE_BUSY error in this case) the REST api returns a 500 response most of the time.

Returning another error code is possible:

To do so we need to rework the pub fn internal_server_error<T: Into<InternalServerError>>(message: T) -> Box<dyn warp::Reply> from mithril-aggregator/src/http_server/routes/reply.rs. It would only take StdError as parameter so it could downcast ref it to a sqlite crate error, check the error code, see if it correspond to an SQLITE_BUSY and if yes changes the response from a 500 to a 503.

What do you think ?

jpraynaud commented 4 months ago

I'm wondering about what will happen during the cleanup for the REST API of the aggregator: will it be locked? Can we handle that situation with a 503 response?

@jpraynaud Currently if the repository layer fails because of an error (most likely a [SQLITE_BUSY](https://www.sqlite.org/rescode.html#busy) error in this case) the REST api returns a 500 response most of the time.

Returning another error code is possible:

To do so we need to rework the pub fn internal_server_error<T: Into<InternalServerError>>(message: T) -> Box<dyn warp::Reply> from mithril-aggregator/src/http_server/routes/reply.rs. It would only take StdError as parameter so it could downcast ref it to a sqlite crate error, check the error code, see if it correspond to an SQLITE_BUSY and if yes changes the response from a 500 to a 503.

What do you think ?

I was thinking about a warp middleware that would return a 503 when the UpkeepService is running the cleanup (and that would be used on the routes which are impacted by the cleanup). But your solution can work as well :+1:

input-output-hk / mithril