Open anil-db opened 10 months ago
Thanks for the report! This is weird, because we actually validated this exact case recently as part of https://github.com/vectordotdev/vector/issues/11405.
Is it possible the batches are being flushed but not being sent successfully? Running Vector with debug logs should make it more clear.
Is it possible the batches are being flushed but not being sent successfully? Running Vector with debug logs should make it more clear.
yeah. batches are being flushed if we wait for batch timeout secs. if grace period finishes before that then it wont flush. I think ask is to interrupt that wait at termination. There is reason we cannot reduce batch timeout secs because that may generate lot of small files and cannot have very big grace period that affect rollouts time to upgrade whole vector cluster.
I added some of my own print statement to test. here timeout sec for batch is 3 min and there is just 1 event sent. upload to s3 is unsuccessful but that is not the cause as there is no retry in s3 and shutdown happen very fast after s3 request https://github.com/vectordotdev/vector/issues/19601
SIGINT happened at 2024-01-12T00:09:41.804809Z INFO vector::signal: Signal received. signal="SIGINT" and at batch was handed over to request builder 2024-01-12T00:12:35.405263Z INFO sink{component_kind="sink" component_id=s3 component_type=aws_s3}: vector_stream::partitioned_batcher: handing over closed batch
after that shutdown happened very fast irrespective of s3 failing.
2024-01-12T00:09:35.400610Z INFO sink{component_kind="sink" component_id=s3 component_type=aws_s3}: vector_stream::partitioned_batcher: batch an item
2024-01-12T00:09:35.400819Z INFO vector_buffers::topology::channel::limited_queue: LimitedSender droping
^C2024-01-12T00:09:41.804809Z INFO vector::signal: Signal received. signal="SIGINT"
2024-01-12T00:09:41.805043Z INFO vector: Vector has stopped.
2024-01-12T00:09:41.805424Z INFO source{component_kind="source" component_id=dummy_logs component_type=http_server}: vector_buffers::topology::channel::limited_queue: LimitedSender droping
2024-01-12T00:09:41.806014Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3, dummy_logs" time_remaining="no time limit"
2024-01-12T00:09:46.806849Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:09:51.806873Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:09:56.806922Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:10:01.808060Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:10:06.807093Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:10:11.807035Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:10:16.807043Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:10:21.807173Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:10:26.807283Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:10:31.807373Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:10:36.806724Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:10:41.807554Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:10:46.807464Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:10:51.807714Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:10:56.807798Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:11:01.807787Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:11:06.807061Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:11:11.808624Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:11:16.807667Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:11:21.808804Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:11:26.808178Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:11:31.807523Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:11:36.807014Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:11:41.807271Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:11:46.807199Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:11:51.808635Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:11:56.808766Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:12:01.807392Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:12:06.808010Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:12:11.808101Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:12:16.808366Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:12:21.808242Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:12:26.808299Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:12:31.807982Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:12:35.404961Z INFO sink{component_kind="sink" component_id=s3 component_type=aws_s3}: vector_stream::partitioned_batcher: timer expired for batch
2024-01-12T00:12:35.405263Z INFO sink{component_kind="sink" component_id=s3 component_type=aws_s3}: vector_stream::partitioned_batcher: handing over closed batch
2024-01-12T00:12:35.653302Z ERROR sink{component_kind="sink" component_id=s3 component_type=aws_s3}:request{request_id=1}: vector::sinks::util::retries: Non-retriable error; dropping the request. error=service error internal_log_rate_limit=true
2024-01-12T00:12:35.653924Z ERROR sink{component_kind="sink" component_id=s3 component_type=aws_s3}:request{request_id=1}: vector_common::internal_event::service: Service call failed. No retries or retries exhausted. error=Some(ServiceError(ServiceError { source: PutObjectError { kind: Unhandled(Unhandled { source: Error { code: Some("PermanentRedirect"), message: Some("The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint."), request_id: Some("XKEWR97TNAT9TW59"), extras: {"s3_extended_request_id": "Ujv/gt/b7NTrIsMxcUk5JbklpDMSdCebsT55q59xlqC36uY2h1F7T04naarIrJLz0fe61E+h0Ck="} } }), meta: Error { code: Some("PermanentRedirect"), message: Some("The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint."), request_id: Some("XKEWR97TNAT9TW59"), extras: {"s3_extended_request_id": "Ujv/gt/b7NTrIsMxcUk5JbklpDMSdCebsT55q59xlqC36uY2h1F7T04naarIrJLz0fe61E+h0Ck="} } }, raw: Response { inner: Response { status: 301, version: HTTP/1.1, headers: {"x-amz-bucket-region": "us-east-1", "x-amz-request-id": "XKEWR97TNAT9TW59", "x-amz-id-2": "Ujv/gt/b7NTrIsMxcUk5JbklpDMSdCebsT55q59xlqC36uY2h1F7T04naarIrJLz0fe61E+h0Ck=", "content-type": "application/xml", "transfer-encoding": "chunked", "date": "Fri, 12 Jan 2024 00:12:35 GMT", "server": "AmazonS3", "connection": "close"}, body: SdkBody { inner: Once(Some(b"<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<Error><Code>PermanentRedirect</Code><Message>The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint.</Message><Endpoint>s3.amazonaws.com</Endpoint><Bucket>my-bucket</Bucket><RequestId>XKEWR97TNAT9TW59</RequestId><HostId>Ujv/gt/b7NTrIsMxcUk5JbklpDMSdCebsT55q59xlqC36uY2h1F7T04naarIrJLz0fe61E+h0Ck=</HostId></Error>")), retryable: true } }, properties: SharedPropertyBag(Mutex { data: PropertyBag, poisoned: false, .. }) } })) request_id=1 error_type="request_failed" stage="sending" internal_log_rate_limit=true
2024-01-12T00:12:35.654299Z ERROR sink{component_kind="sink" component_id=s3 component_type=aws_s3}:request{request_id=1}: vector_common::internal_event::component_events_dropped: Events dropped intentional=false count=1 reason="Service call failed. No retries or retries exhausted." internal_log_rate_limit=true
2024-01-12T00:12:35.654414Z INFO sink{component_kind="sink" component_id=s3 component_type=aws_s3}:request{request_id=1}: vector_common::finalization: droping event finalizer
2024-01-12T00:12:35.654451Z INFO sink{component_kind="sink" component_id=s3 component_type=aws_s3}:request{request_id=1}: vector_common::finalization: updating status AtomicCell { value: Delivered } Rejected
2024-01-12T00:12:35.654500Z INFO sink{component_kind="sink" component_id=s3 component_type=aws_s3}:request{request_id=1}: vector_common::finalization: droping notifier
2024-01-12T00:12:35.654528Z INFO sink{component_kind="sink" component_id=s3 component_type=aws_s3}:request{request_id=1}: vector_common::finalization: sending drop status Rejected
2024-01-12T00:12:35.654847Z INFO sink{component_kind="sink" component_id=s3 component_type=aws_s3}: vector_stream::partitioned_batcher: stream close see if there unfinished batch
2024-01-12T00:12:36.808033Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
Thanks for the additional detail. This seems like a bug then since it was working when we verified https://github.com/vectordotdev/vector/issues/11405
Thanks for the additional detail. This seems like a bug then since it was working when we verified #11405
It seems it only happen when we use disk buffer with s3 sink. when using memory buffer flush triggers immediately at shutdown.
Thanks for the additional detail. This seems like a bug then since it was working when we verified #11405
It seems it only happen when we use disk buffer with s3 sink. when using memory buffer flush triggers immediately at shutdown.
I was gonna ask about that actually. When Vector is shutting down it will attempt to flush the buffers, including disk buffers. In the future, we'd like to have the disk buffers just close during shutdown rather than trying to flush. Do you think, in your case, that it could be waiting to try to flush all of the data out of the disk buffer? Or do you think the disk buffer is empty but it is still waiting to shutdown until the batch timeouts trigger?
Thanks for the additional detail. This seems like a bug then since it was working when we verified #11405
It seems it only happen when we use disk buffer with s3 sink. when using memory buffer flush triggers immediately at shutdown.
I was gonna ask about that actually. When Vector is shutting down it will attempt to flush the buffers, including disk buffers. In the future, we'd like to have the disk buffers just close during shutdown rather than trying to flush. Do you think, in your case, that it could be waiting to try to flush all of the data out of the disk buffer? Or do you think the disk buffer is empty but it is still waiting to shutdown until the batch timeouts trigger?
we had just 1 event so reader side of disk buffer was already upto date, It seems reader side of disk buffer do not get signal. while in case of memory buffer when limitedSender object drop limitedReceiver side poll get finished. this signaling is not happening in disk buffer, as per my understanding.
Interesting, thanks for the context. I updated the title since I think it affects more than just the AWS S3 sink.
Did this get clarified? Kind of worrying that flushing logs is a potential pain point here.
I'm also curious on the status of this. This potentially affects a use case for Vector that I have.
I've also observed this behavior in the GCS sink when using the disk buffer:
2024-10-08T11:23:17.587387Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="gcs" time_remaining="24 seconds left"
2024-10-08T11:23:22.587394Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="gcs" time_remaining="19 seconds left"
2024-10-08T11:23:27.587294Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="gcs" time_remaining="14 seconds left"
2024-10-08T11:23:32.586562Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="gcs" time_remaining="9 seconds left"
2024-10-08T11:23:37.587402Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="gcs" time_remaining="4 seconds left"
ERROR vector::topology::running: Failed to gracefully shut down in time. Killing components. components="gcs"
A note for the community
Problem
if s3 sink is using batch and during shutdown batch is not full it keep waiting for batch to become full or complete. grace period may expire earlier causing those events to be not delivered. depending of type of buffer used it would cause loss or delayed delivery and ungraceful shutdown.
Configuration
Version
35.0
Debug Output
Example Data
No response
Additional Context
with this config send 1 http request
curl -v http://localhost:443 -d "hi"
and SIGTERM the vector. graceful shutdown stuck at closing sink.References
No response