quickwit-oss / quickwit

Cloud-native search engine for observability. An open-source alternative to Datadog, Elasticsearch, Loki, and Tempo.
https://quickwit.io
Other
7.75k stars 310 forks source link

Janitor gets in a bad state after getting rate limited error from S3. #5242

Open fulmicoton opened 1 month ago

fulmicoton commented 1 month ago

as reported on airmail.

Noticed the janitor process has been having some trouble
2024-07-21T01:02:10.782Z ERROR delete_splits_marked_for_deletion{index_uid=IndexUid { index_id: "logline.f0c15c0a85.2024-07-20.11", incarnation_id: Ulid(2081085527663609606974335987258375563) } updated_before_timestamp=1721521809}: quickwit_index_management::garbage_collection: Failed to delete split file(s) ["01J37X0RFJXJKESYER33K401Q2.split", "01J37Z3CPQ49FXTRWH0GZ20X4F.split", "01J37ZN4JH74KJGYDBFFC17C03.split", "01J37ZHV2EJPVCRXGWHZZDF9AX.split", "01J37YER0XZAKD4Y40F4TR0FG8.split", and 995 more] from storage. error=Some(StorageError { kind: Service, source: service error: unhandled error (SlowDown): Error { code: "SlowDown", message: "Please reduce your request rate.", aws_request_id: "AG2P4X4YBFXFX3PQ", s3_extended_request_id: "RW+r7PFei4MowizS/PIrBRxbCUOTGbz/XX8RCi00sik28bFzl3uQNzSJ+56g2n05oyCN+R+hQ2Y=" } (ServiceError(ServiceError { source: Unhandled(Unhandled { source: ErrorMetadata { code: Some("SlowDown"), message: Some("Please reduce your request rate."), extras: Some({"aws_request_id": "AG2P4X4YBFXFX3PQ", "s3_extended_request_id": "RW+r7PFei4MowizS/PIrBRxbCUOTGbz/XX8RCi00sik28bFzl3uQNzSJ+56g2n05oyCN+R+hQ2Y="}) }, meta: ErrorMetadata { code: Some("SlowDown"), message: Some("Please reduce your request rate."), extras: Some({"aws_request_id": "AG2P4X4YBFXFX3PQ", "s3_extended_request_id": "RW+r7PFei4MowizS/PIrBRxbCUOTGbz/XX8RCi00sik28bFzl3uQNzSJ+56g2n05oyCN+R+hQ2Y="}) } }), raw: Response { status: StatusCode(503), headers: Headers { headers: {"x-amz-request-id": HeaderValue { _private: H0("AG2P4X4YBFXFX3PQ") }, "x-amz-id-2": HeaderValue { _private: H0("RW+r7PFei4MowizS/PIrBRxbCUOTGbz/XX8RCi00sik28bFzl3uQNzSJ+56g2n05oyCN+R+hQ2Y=") }, "content-type": HeaderValue { _private: H0("application/xml") }, "transfer-encoding": HeaderValue { _private: H0("chunked") }, "date": HeaderValue { _private: H0("Sun, 21 Jul 2024 01:02:10 GMT") }, "server": HeaderValue { _private: H0("AmazonS3") }, "connection": HeaderValue { _private: H0("close") }} }, body: SdkBody { inner: Once(Some(b"<Error><Code>SlowDown</Code><Message>Please reduce your request rate.</Message><RequestId>AG2P4X4YBFXFX3PQ</RequestId><HostId>RW+r7PFei4MowizS/PIrBRxbCUOTGbz/XX8RCi00sik28bFzl3uQNzSJ+56g2n05oyCN+R+hQ2Y=</HostId></Error>")), retryable: true }, extensions: Extensions { extensions_02x: Extensions, extensions_1x: Extensions } } })) }) index_id="logline.f0c15c0a85.2024-07-20.11"
PSeitz commented 1 month ago

I did an experiment where I created stale splits and then started quickwit where delete requests to the S3 client would fail with 90% probability.

It did successfully retry until the splits were deleted and then returned in a idle state.


2024-07-23T08:08:00.462Z  INFO quickwit_janitor::actors::garbage_collector: loaded 3 indexes from the metastore
2024-07-23T08:08:00.462Z  INFO quickwit_storage::object_storage::s3_compatible_storage: using S3 credentials defined in storage config
2024-07-23T08:08:00.462Z  INFO quickwit_storage::object_storage::s3_compatible_storage: using S3 region defined in storage config region=us-west
2024-07-23T08:08:00.462Z  INFO quickwit_common: using environment variable `QW_S3_FORCE_PATH_STYLE_ACCESS` default value value=true
2024-07-23T08:08:00.462Z  INFO quickwit_storage::object_storage::s3_compatible_storage: using S3 endpoint defined in storage config or environment variable endpoint=http://127.0.0.1:9000
2024-07-23T08:08:00.462Z  INFO quickwit_storage::object_storage::s3_compatible_storage: using S3 credentials defined in storage config
2024-07-23T08:08:00.462Z  INFO quickwit_storage::object_storage::s3_compatible_storage: using S3 region defined in storage config region=us-west
2024-07-23T08:08:00.462Z  INFO quickwit_common: using environment variable `QW_S3_FORCE_PATH_STYLE_ACCESS` default value value=true
2024-07-23T08:08:00.462Z  INFO quickwit_storage::object_storage::s3_compatible_storage: using S3 endpoint defined in storage config or environment variable endpoint=http://127.0.0.1:9000
2024-07-23T08:08:00.462Z  INFO quickwit_storage::object_storage::s3_compatible_storage: using S3 credentials defined in storage config
2024-07-23T08:08:00.462Z  INFO quickwit_storage::object_storage::s3_compatible_storage: using S3 region defined in storage config region=us-west
2024-07-23T08:08:00.462Z  INFO quickwit_common: using environment variable `QW_S3_FORCE_PATH_STYLE_ACCESS` default value value=true
2024-07-23T08:08:00.462Z  INFO quickwit_storage::object_storage::s3_compatible_storage: using S3 endpoint defined in storage config or environment variable endpoint=http://127.0.0.1:9000
[quickwit-storage/src/object_storage/s3_compatible_storage.rs:804:16] rand::random::<f64>() = 0.41916115486052985
[quickwit-storage/src/object_storage/s3_compatible_storage.rs:804:46] val = 0.9
2024-07-23T08:08:00.464Z  WARN delete_splits_marked_for_deletion{index_uid=IndexUid { index_id: "simian_chico", incarnation_id: Ulid(2081418938362966106885799749050468484) } updated_before_timestamp=1721722020}: quickwit_sto
rage::object_storage::s3_compatible_storage: Bulk delete FAKE ERR
2024-07-23T08:08:00.464Z ERROR delete_splits_marked_for_deletion{index_uid=IndexUid { index_id: "simian_chico", incarnation_id: Ulid(2081418938362966106885799749050468484) } updated_before_timestamp=1721722020}: quickwit_ind
ex_management::garbage_collection: Failed to delete split file(s) ["01J3F9GHTMMDJE28ZMQYNHJ3YK.split", "01J3F9FFENDSQ4MHVA6702PKDE.split", "01J3F93GRHZEJ26D5RPXFYFR2P.split", "01J3F8ZC8PS82YTE2NBQW7V752.split", "01J3F97TS9PY
E4ZYT0M7V524CV.split", and 205 more] from storage. error=None index_id="simian_chico"

2024-07-23T08:08:10.468Z  INFO quickwit_janitor::actors::garbage_collector: loaded 3 indexes from the metastore
2024-07-23T08:08:10.468Z  INFO quickwit_storage::object_storage::s3_compatible_storage: using S3 credentials defined in storage config
2024-07-23T08:08:10.468Z  INFO quickwit_storage::object_storage::s3_compatible_storage: using S3 region defined in storage config region=us-west
2024-07-23T08:08:10.468Z  INFO quickwit_common: using environment variable `QW_S3_FORCE_PATH_STYLE_ACCESS` default value value=true
2024-07-23T08:08:10.468Z  INFO quickwit_storage::object_storage::s3_compatible_storage: using S3 endpoint defined in storage config or environment variable endpoint=http://127.0.0.1:9000
2024-07-23T08:08:10.469Z  INFO quickwit_storage::object_storage::s3_compatible_storage: using S3 credentials defined in storage config
2024-07-23T08:08:10.469Z  INFO quickwit_storage::object_storage::s3_compatible_storage: using S3 region defined in storage config region=us-west
2024-07-23T08:08:10.469Z  INFO quickwit_common: using environment variable `QW_S3_FORCE_PATH_STYLE_ACCESS` default value value=true
2024-07-23T08:08:10.469Z  INFO quickwit_storage::object_storage::s3_compatible_storage: using S3 endpoint defined in storage config or environment variable endpoint=http://127.0.0.1:9000
2024-07-23T08:08:10.469Z  INFO quickwit_storage::object_storage::s3_compatible_storage: using S3 credentials defined in storage config
2024-07-23T08:08:10.469Z  INFO quickwit_storage::object_storage::s3_compatible_storage: using S3 region defined in storage config region=us-west
2024-07-23T08:08:10.469Z  INFO quickwit_common: using environment variable `QW_S3_FORCE_PATH_STYLE_ACCESS` default value value=true
2024-07-23T08:08:10.469Z  INFO quickwit_storage::object_storage::s3_compatible_storage: using S3 endpoint defined in storage config or environment variable endpoint=http://127.0.0.1:9000
[quickwit-storage/src/object_storage/s3_compatible_storage.rs:804:16] rand::random::<f64>() = 0.9919412888981723
[quickwit-storage/src/object_storage/s3_compatible_storage.rs:804:46] val = 0.9
2024-07-23T08:08:10.471Z  WARN delete_splits_marked_for_deletion{index_uid=IndexUid { index_id: "simian_chico", incarnation_id: Ulid(2081418938362966106885799749050468484) } updated_before_timestamp=1721722030}: quickwit_sto
rage::object_storage::s3_compatible_storage: Bulk delete PROCEED
2024-07-23T08:08:11.663Z  INFO delete_splits_marked_for_deletion{index_uid=IndexUid { index_id: "simian_chico", incarnation_id: Ulid(2081418938362966106885799749050468484) } updated_before_timestamp=1721722030}: quickwit_met
astore::metastore::file_backed::file_backed_index: deleted 0 splits from index index_id=simian_chico
2024-07-23T08:08:11.674Z  INFO quickwit_janitor::actors::garbage_collector: Janitor deleted {"01J3F96QRSV6PZTJY2Z1TXY5SB.split", "01J3F99B5HVHVX97H23KQJJEJQ.split", "01J3F96455T16RZKB7KW8QM2ND.split", "01J3F97N6MVAHEWDKJFM81
JE4T.split", "01J3F9127GHCMAD734T1GMNABF.split"} and 210 other splits. index_id=simian_chico num_deleted_splits=210
fulmicoton commented 1 month ago

@PSeitz Can you discuss with @esatterwhite to get details about the observed issue? Do you have access to airmail logs?

Once you get more info about the problem, can you edit the description of this ticket. Ping me if you get blocked on the fix.

PSeitz commented 1 month ago

Info from eric was that it stopped logging for several hours and he tries to reproduce. I don't have access to the logs