vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17k stars 1.47k forks source link

S3/SQS source events fails sporadically #20314

Open Brightside56 opened 2 months ago

Brightside56 commented 2 months ago

Problem

I have configured s3 source and getting following error from time to time, once in a ~30-60 min

ERROR source{component_kind="source" component_id=s3 component_type=aws_s3}: vector::internal_events::aws_sqs::s3: Failed to process SQS message. message_id=8983e08a-46b8-48b8-a7f8-b06120861783 error=Failed to fetch s3://xxx/2024/04/16/1713311288-2e57bf4b-6612-4ab8-960e-517f01126830.log.gz: dispatch failure error_code="failed_processing_sqs_message" error_type="parser_failed" stage="processing" internal_log_rate_limit=true

however after being in-flight 10 minutes or so, these messages with corresponding S3 objects seem to be reprocessed successfully from SQS

Configuration

Below are s3 source config:

          sources:
            s3:
              type: aws_s3
              compression: gzip
              region: eu-west-1
              sqs:
                queue_url: https://sqs.eu-west-1.amazonaws.com/xxx/xxx-xxx
                delete_message: true
                client_concurrency: 5

also I have logLevel: aws_smithy_http=info,vector=info, however along with above-mentioned error message and various vector: Beep - there are no other informative error messages which could explain root cause of issue

Version

0.37.1

jszwedko commented 2 months ago

Hmm, yeah, that is odd. I think that sort of error would be coming from the AWS SDK (similar issue, but not for S3: https://github.com/awslabs/aws-sdk-rust/issues/844). You could try increasing the log level of aws_smithy_http to, say, debug or trace but I realize it'd be pretty chatty and so might not be feasible.

hoazgazh commented 2 months ago

same here, is anyone resolve this?

2024-05-04T17:07:15.594521Z INFO vector::topology::running: Running healthchecks. 2024-05-04T17:07:15.595648Z INFO vector: Vector has started. debug="false" version="0.37.1" arch="x86_64" revision="cb6635a 2024-04-09 13:45:06.561412437" 2024-05-04T17:07:15.595990Z INFO vector::app: API is disabled, enable by settingapi.enabledtotrueand use commands likevector top. 2024-05-04T17:07:15.609041Z INFO source{component_kind="source" component_id=my_source_id component_type=aws_s3}:lazy_load_identity: aws_smithy_runtime::client::identity::cache::lazy: identity cache miss occurred; added new identity (took 185µs) new_expiration=2024-05-04T17:22:15.607636Z valid_for=899.998602s partition=IdentityCachePartition(2) 2024-05-04T17:07:15.826752Z INFO vector::topology::builder: Healthcheck passed. 2024-05-04T17:07:15.860011Z ERROR source{component_kind="source" component_id=my_source_id component_type=aws_s3}: vector::internal_events::aws_sqs: Failed to fetch SQS events. error=service error error_code="failed_fetching_sqs_events" error_type="request_failed" stage="receiving" internal_log_rate_limit=true 2024-05-04T17:07:15.907364Z ERROR source{component_kind="source" component_id=my_source_id component_type=aws_s3}: vector::internal_events::aws_sqs: Internal log [Failed to fetch SQS events.] is being suppressed to avoid flooding.

fpytloun commented 1 month ago

Also happens with S3 sink:

2024-05-22T09:40:24.026717Z  WARN sink{component_kind="sink" component_id=out_kafka_access_s3 component_type=aws_s3}:request{request_id=23796}: vector::sinks::util::retries: Retrying after error. error=dispatch failure internal_log_rate_limit=true
seluard commented 3 weeks ago

I think all of these is related about how AWS sdk handle retries and failures for AWS IAM.

https://github.com/vectordotdev/vector/issues/20266

Brightside56 commented 2 weeks ago

Hmm, yeah, that is odd. I think that sort of error would be coming from the AWS SDK (similar issue, but not for S3: https://github.com/awslabs/aws-sdk-rust/issues/844). You could try increasing the log level of aws_smithy_http to, say, debug or trace but I realize it'd be pretty chatty and so might not be feasible.

I've tried, but messages weren't more verbose in this specific part even with aws_smithy_http=trace