vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.51k stars 1.53k forks source link

Vector acks messages sent by S3 source even when delivery failed #19711

Closed tanushri-sundar closed 7 months ago

tanushri-sundar commented 8 months ago

A note for the community

Problem

I am sending messages from S3 to GCP PubSub using Vector.

I have observed that messages rejected by the PubSub sink for non-retryable errors (ex: 400, 403) are dropped.

2024-01-18T21:35:42.649078Z ERROR vector_common::internal_event::component_events_dropped: Internal log [Events dropped] is being suppressed to avoid flooding.

The dropped messages are deleted from SQS, even with a DLQ configured on my SQS.

I would expect that the SQS message not be deleted per Vector’s end-to-end-acknowledgement guarantee.


Expected behavior: Vector should delete messages in S3 only if delivery to the sink is successful. I have traced this issue back to the S3 source code here:

match result {
    BatchStatus::Delivered => Ok(()),
    BatchStatus::Errored => Err(ProcessingError::ErrorAcknowledgement),
    BatchStatus::Rejected => {
        // Sinks are responsible for emitting ComponentEventsDropped.
        // Failed events cannot be retried, so continue to delete the SQS source message.
        Ok(())
    }

The BatchStatus::Rejected case should call Err(ProcessingError::ErrorAcknowledgement).

This would allow failed messages to be DLQed, as the user can configure a DLQ on their SQS and set custom values for maximum retries and visibility timeout.

This fix could potentially be a configuration based approach, with a parameter such as strict_ack = true preventing Vector's overeager deletion.

Configuration

No response

Version

0.33.0

Debug Output

No response

Example Data

No response

Additional Context

No response

References

14899

14708

10870

jszwedko commented 8 months ago

This is one of the issues intended to be addressed by https://github.com/vectordotdev/vector/pull/14708.

tanushri-sundar commented 8 months ago

@jszwedko I agree, this could definitely be solved by the RFC #14708.

A smaller scope fix for this particular source would be very helpful, and it could potentially be a low-touch code change. Since there isn't a timeline for implementing the RFC, perhaps starting with this fix can help unblock some users more quickly.

jszwedko commented 7 months ago

Yeah, after reflecting a bit I agree that this seems like a safe incremental change to make even in the context of that RFC. I'd be happy to see a PR for it if you like.

tanushri-sundar commented 7 months ago

@jszwedko That sounds great! I've opened PR #19748

jszwedko commented 7 months ago

Closed by https://github.com/vectordotdev/vector/pull/19748