quickwit-oss / quickwit

Cloud-native search engine for observability. An open-source alternative to Datadog, Elasticsearch, Loki, and Tempo.
https://quickwit.io
Other
8.25k stars 337 forks source link

request AWS AssumeRoleWithWebIdentity API #4799

Open tuziben opened 7 months ago

tuziben commented 7 months ago

Describe the bug A clear and concise description of what the bug is.

Version: v0.8.0

In the last 24 hours, the searcher cluster requested AWS AssumeRoleWithWebIdentity API 34,934,189 times. From what I understand, Quickwit is supposed to store credential temporarily for an hour. However, we've noticed that it's getting bombarded with a huge volume of requests nearly every second. The error log from a quickwit pod.

image
2024-03-26T08:16:36.417Z ERROR fetch_docs: quickwit_search::fetch_docs: error when fetching docs in splits split_ids=["01HSWWX3VSY1K4C2AAYY6WG898", "01HSWWTHNAMAKHV7ZHH3M9PYZJ"] error=open-index-for-split

Caused by:
    0: failed to fetch hotcache and footer from s3://path for split `01HSWWTHNAMAKHV7ZHH3M9PYZJ`
    1: storage error(kind=Internal, source=failed to construct request: failed to load credentials from the credentials cache: an error occurred while loading credentials: service error: unhandled error: unhandled error: Error { code: "Throttling", message: "Rate exceeded", aws_request_id: "e08a1aa3-6d6c-487e-a6f2-" } (ConstructionFailure(ConstructionFailure { source: CredentialsStageError { source: ProviderError(ProviderError { source: ServiceError(ServiceError { source: Unhandled(Unhandled { source: ErrorMetadata { code: Some("Throttling"), message: Some("Rate exceeded"), extras: Some({"aws_request_id": "e08a1aa3-6d6c-487e-a6f2-"}) }, meta: ErrorMetadata { code: Some("Throttling"), message: Some("Rate exceeded"), extras: Some({"aws_request_id": "e08a1aa3-6d6c-487e-a6f2-"}) } }), raw: Response { inner: Response { status: 400, version: HTTP/1.1, headers: {"x-amzn-requestid": "e08a1aa3-6d6c-487e-a6f2-", "content-type": "text/xml", "content-length": "255", "date": "Tue, 26 Mar 2024 08:16:36 GMT", "connection": "close"}, body: SdkBody { inner: Once(Some(b"<ErrorResponse xmlns=\"https://sts.amazonaws.com/doc/2011-06-15/\">\n  <Error>\n    <Type>Sender</Type>\n    <Code>Throttling</Code>\n    <Message>Rate exceeded</Message>\n  </Error>\n  <RequestId>e08a1aa3-6d6c-487e-a6f2-</RequestId>\n</ErrorResponse>\n")), retryable: true } }, properties: SharedPropertyBag(Mutex { data: PropertyBag { contents: ["aws_types::SigningService", "aws_smithy_http::operation::Metadata", "aws_smithy_http::connection::CaptureSmithyConnection", "aws_http::user_agent::AwsUserAgent", "aws_sig_auth::signer::OperationSigningConfig", "aws_sdk_sts::endpoint::Params", "aws_types::region::Region", "aws_smithy_types::endpoint::Endpoint", "aws_credential_types::cache::SharedCredentialsCache", "alloc::vec::Vec<http::version::Version>", "aws_types::region::SigningRegion"] }, poisoned: false, .. }) } }) }) } })))
    2: failed to construct request: failed to load credentials from the credentials cache: an error occurred while loading credentials: service error: unhandled error: unhandled error: Error { code: "Throttling", message: "Rate exceeded", aws_request_id: "e08a1aa3-6d6c-487e-a6f2-" } (ConstructionFailure(ConstructionFailure { source: CredentialsStageError { source: ProviderError(ProviderError { source: ServiceError(ServiceError { source: Unhandled(Unhandled { source: ErrorMetadata { code: Some("Throttling"), message: Some("Rate exceeded"), extras: Some({"aws_request_id": "e08a1aa3-6d6c-487e-a6f2-"}) }, meta: ErrorMetadata { code: Some("Throttling"), message: Some("Rate exceeded"), extras: Some({"aws_request_id": "e08a1aa3-6d6c-487e-a6f2-"}) } }), raw: Response { inner: Response { status: 400, version: HTTP/1.1, headers: {"x-amzn-requestid": "e08a1aa3-6d6c-487e-a6f2-", "content-type": "text/xml", "content-length": "255", "date": "Tue, 26 Mar 2024 08:16:36 GMT", "connection": "close"}, body: SdkBody { inner: Once(Some(b"<ErrorResponse xmlns=\"https://sts.amazonaws.com/doc/2011-06-15/\">\n  <Error>\n    <Type>Sender</Type>\n    <Code>Throttling</Code>\n    <Message>Rate exceeded</Message>\n  </Error>\n  <RequestId>e08a1aa3-6d6c-487e-a6f2-</RequestId>\n</ErrorResponse>\n")), retryable: true } }, properties: SharedPropertyBag(Mutex { data: PropertyBag { contents: ["aws_types::SigningService", "aws_smithy_http::operation::Metadata", "aws_smithy_http::connection::CaptureSmithyConnection", "aws_http::user_agent::AwsUserAgent", "aws_sig_auth::signer::OperationSigningConfig", "aws_sdk_sts::endpoint::Params", "aws_types::region::Region", "aws_smithy_types::endpoint::Endpoint", "aws_credential_types::cache::SharedCredentialsCache", "alloc::vec::Vec<http::version::Version>", "aws_types::region::SigningRegion"] }, poisoned: false, .. }) } }) }) } }))
    3: failed to fetch slice 8194735750..8199046294 for object: s3://path/01HSWWTHNAMAKHV7ZHH3M9PYZJ.split
guilload commented 7 months ago

We're using the default AWS Rusk SDK credentials cache, which should cache credentials for 15 minutes. Can you set the log level to debug for aws_credential_types::cache and track the following logging statements for a single searcher:

tuziben commented 7 months ago

Thanks for your advice, but I don't know how to enable the debug log. I don't find it in the doc or in the helm chart.

fmassot commented 7 months ago

You can:

fulmicoton commented 7 months ago

Debug like this will generate a looooot of logs. Enough to disrupt your server. You can apply debug to a specific module only:

info,aws_credential_types::cache=debug

You can POST that string to the url françois shared or put it in the RUST_LOG environment variable.

fulmicoton commented 7 months ago

After discussion with people from AWS, they suggest the bug is coming from the AWS SDK we use.