vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
18.29k stars 1.61k forks source link

Disk buffers should close at shut-down #19600

Open anil-db opened 10 months ago

anil-db commented 10 months ago

A note for the community

Problem

if s3 sink is using batch and during shutdown batch is not full it keep waiting for batch to become full or complete. grace period may expire earlier causing those events to be not delivered. depending of type of buffer used it would cause loss or delayed delivery and ungraceful shutdown.

Configuration

data_dir = "/tmp/vector/data"
schema.log_namespace = false
acknowledgements.enabled = true
[api]
enabled = true
[sources.dummy_logs]
type = "http_server"
address = "0.0.0.0:443"
decoding.codec = "bytes"
tls.enabled = false
[sinks.s3]
type = "aws_s3"
bucket = "my-bucket"
batch.max_events = 100000
batch.timeout_secs = 600
inputs = ["dummy_logs"]
encoding.codec = "raw_message"
buffer.max_size = 1000000000
buffer.type = "disk"

Version

35.0

Debug Output

2024-01-11T20:41:58.799877Z  INFO vector::app: Log level is enabled. level="vector=info,codec=info,vrl=info,file_source=info,tower_limit=info,rdkafka=info,buffers=info,lapin=info,kube=info"
2024-01-11T20:41:58.800877Z  INFO vector::app: Loading configs. paths=["vector.toml"]
2024-01-11T20:41:59.365666Z  INFO vector::topology::running: Running healthchecks.
2024-01-11T20:41:59.365865Z  INFO vector: Vector has started. debug="true" version="0.35.0" arch="x86_64" revision=""
2024-01-11T20:41:59.366167Z  INFO source{component_kind="source" component_id=dummy_logs component_type=http_server}: vector::sources::util::http::prelude: Building HTTP server. address=0.0.0.0:443
2024-01-11T20:41:59.367863Z  INFO vector::internal_events::api: API server running. address=127.0.0.1:8686 playground=http://127.0.0.1:8686/playground
2024-01-11T20:41:59.639459Z ERROR vector::topology::builder: msg="Healthcheck failed." error=Unknown status code: 301 Moved Permanently component_kind="sink" component_type="aws_s3" component_id=s3
^C2024-01-11T20:42:09.619090Z  INFO vector::signal: Signal received. signal="SIGINT"
2024-01-11T20:42:09.619180Z  INFO vector: Vector has stopped.
2024-01-11T20:42:09.620726Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="59 seconds left"
2024-01-11T20:42:14.620626Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="54 seconds left"
2024-01-11T20:42:19.621906Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="49 seconds left"
2024-01-11T20:42:24.621188Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="44 seconds left"
2024-01-11T20:42:29.621132Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="39 seconds left"
2024-01-11T20:42:34.621181Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="34 seconds left"
2024-01-11T20:42:39.621160Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="29 seconds left"
2024-01-11T20:42:44.620830Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="24 seconds left"
2024-01-11T20:42:49.620156Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="19 seconds left"
2024-01-11T20:42:54.621015Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="14 seconds left"
2024-01-11T20:42:59.620651Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="9 seconds left"
2024-01-11T20:43:04.620839Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="4 seconds left"
2024-01-11T20:43:09.620738Z ERROR vector::topology::running: Failed to gracefully shut down in time. Killing components. components="s3"

Example Data

No response

Additional Context

with this config send 1 http request curl -v http://localhost:443 -d "hi" and SIGTERM the vector. graceful shutdown stuck at closing sink.

References

No response

jszwedko commented 10 months ago

Thanks for the report! This is weird, because we actually validated this exact case recently as part of https://github.com/vectordotdev/vector/issues/11405.

jszwedko commented 10 months ago

Is it possible the batches are being flushed but not being sent successfully? Running Vector with debug logs should make it more clear.

anil-db commented 10 months ago

Is it possible the batches are being flushed but not being sent successfully? Running Vector with debug logs should make it more clear.

yeah. batches are being flushed if we wait for batch timeout secs. if grace period finishes before that then it wont flush. I think ask is to interrupt that wait at termination. There is reason we cannot reduce batch timeout secs because that may generate lot of small files and cannot have very big grace period that affect rollouts time to upgrade whole vector cluster.

I added some of my own print statement to test. here timeout sec for batch is 3 min and there is just 1 event sent. upload to s3 is unsuccessful but that is not the cause as there is no retry in s3 and shutdown happen very fast after s3 request https://github.com/vectordotdev/vector/issues/19601

SIGINT happened at 2024-01-12T00:09:41.804809Z INFO vector::signal: Signal received. signal="SIGINT" and at batch was handed over to request builder 2024-01-12T00:12:35.405263Z INFO sink{component_kind="sink" component_id=s3 component_type=aws_s3}: vector_stream::partitioned_batcher: handing over closed batch

after that shutdown happened very fast irrespective of s3 failing.

2024-01-12T00:09:35.400610Z  INFO sink{component_kind="sink" component_id=s3 component_type=aws_s3}: vector_stream::partitioned_batcher: batch an item
2024-01-12T00:09:35.400819Z  INFO vector_buffers::topology::channel::limited_queue: LimitedSender droping
^C2024-01-12T00:09:41.804809Z  INFO vector::signal: Signal received. signal="SIGINT"
2024-01-12T00:09:41.805043Z  INFO vector: Vector has stopped.
2024-01-12T00:09:41.805424Z  INFO source{component_kind="source" component_id=dummy_logs component_type=http_server}: vector_buffers::topology::channel::limited_queue: LimitedSender droping
2024-01-12T00:09:41.806014Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3, dummy_logs" time_remaining="no time limit"
2024-01-12T00:09:46.806849Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:09:51.806873Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:09:56.806922Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:10:01.808060Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:10:06.807093Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:10:11.807035Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:10:16.807043Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:10:21.807173Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:10:26.807283Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:10:31.807373Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:10:36.806724Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:10:41.807554Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:10:46.807464Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:10:51.807714Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:10:56.807798Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:11:01.807787Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:11:06.807061Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:11:11.808624Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:11:16.807667Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:11:21.808804Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:11:26.808178Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:11:31.807523Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:11:36.807014Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:11:41.807271Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:11:46.807199Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:11:51.808635Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:11:56.808766Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:12:01.807392Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:12:06.808010Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:12:11.808101Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:12:16.808366Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:12:21.808242Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:12:26.808299Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:12:31.807982Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:12:35.404961Z  INFO sink{component_kind="sink" component_id=s3 component_type=aws_s3}: vector_stream::partitioned_batcher: timer expired for batch
2024-01-12T00:12:35.405263Z  INFO sink{component_kind="sink" component_id=s3 component_type=aws_s3}: vector_stream::partitioned_batcher: handing over closed batch
2024-01-12T00:12:35.653302Z ERROR sink{component_kind="sink" component_id=s3 component_type=aws_s3}:request{request_id=1}: vector::sinks::util::retries: Non-retriable error; dropping the request. error=service error internal_log_rate_limit=true
2024-01-12T00:12:35.653924Z ERROR sink{component_kind="sink" component_id=s3 component_type=aws_s3}:request{request_id=1}: vector_common::internal_event::service: Service call failed. No retries or retries exhausted. error=Some(ServiceError(ServiceError { source: PutObjectError { kind: Unhandled(Unhandled { source: Error { code: Some("PermanentRedirect"), message: Some("The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint."), request_id: Some("XKEWR97TNAT9TW59"), extras: {"s3_extended_request_id": "Ujv/gt/b7NTrIsMxcUk5JbklpDMSdCebsT55q59xlqC36uY2h1F7T04naarIrJLz0fe61E+h0Ck="} } }), meta: Error { code: Some("PermanentRedirect"), message: Some("The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint."), request_id: Some("XKEWR97TNAT9TW59"), extras: {"s3_extended_request_id": "Ujv/gt/b7NTrIsMxcUk5JbklpDMSdCebsT55q59xlqC36uY2h1F7T04naarIrJLz0fe61E+h0Ck="} } }, raw: Response { inner: Response { status: 301, version: HTTP/1.1, headers: {"x-amz-bucket-region": "us-east-1", "x-amz-request-id": "XKEWR97TNAT9TW59", "x-amz-id-2": "Ujv/gt/b7NTrIsMxcUk5JbklpDMSdCebsT55q59xlqC36uY2h1F7T04naarIrJLz0fe61E+h0Ck=", "content-type": "application/xml", "transfer-encoding": "chunked", "date": "Fri, 12 Jan 2024 00:12:35 GMT", "server": "AmazonS3", "connection": "close"}, body: SdkBody { inner: Once(Some(b"<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<Error><Code>PermanentRedirect</Code><Message>The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint.</Message><Endpoint>s3.amazonaws.com</Endpoint><Bucket>my-bucket</Bucket><RequestId>XKEWR97TNAT9TW59</RequestId><HostId>Ujv/gt/b7NTrIsMxcUk5JbklpDMSdCebsT55q59xlqC36uY2h1F7T04naarIrJLz0fe61E+h0Ck=</HostId></Error>")), retryable: true } }, properties: SharedPropertyBag(Mutex { data: PropertyBag, poisoned: false, .. }) } })) request_id=1 error_type="request_failed" stage="sending" internal_log_rate_limit=true
2024-01-12T00:12:35.654299Z ERROR sink{component_kind="sink" component_id=s3 component_type=aws_s3}:request{request_id=1}: vector_common::internal_event::component_events_dropped: Events dropped intentional=false count=1 reason="Service call failed. No retries or retries exhausted." internal_log_rate_limit=true
2024-01-12T00:12:35.654414Z  INFO sink{component_kind="sink" component_id=s3 component_type=aws_s3}:request{request_id=1}: vector_common::finalization: droping event finalizer
2024-01-12T00:12:35.654451Z  INFO sink{component_kind="sink" component_id=s3 component_type=aws_s3}:request{request_id=1}: vector_common::finalization: updating status AtomicCell { value: Delivered } Rejected
2024-01-12T00:12:35.654500Z  INFO sink{component_kind="sink" component_id=s3 component_type=aws_s3}:request{request_id=1}: vector_common::finalization: droping notifier
2024-01-12T00:12:35.654528Z  INFO sink{component_kind="sink" component_id=s3 component_type=aws_s3}:request{request_id=1}: vector_common::finalization: sending drop status Rejected
2024-01-12T00:12:35.654847Z  INFO sink{component_kind="sink" component_id=s3 component_type=aws_s3}: vector_stream::partitioned_batcher: stream close see if there unfinished batch
2024-01-12T00:12:36.808033Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
jszwedko commented 10 months ago

Thanks for the additional detail. This seems like a bug then since it was working when we verified https://github.com/vectordotdev/vector/issues/11405

anil-db commented 10 months ago

Thanks for the additional detail. This seems like a bug then since it was working when we verified #11405

It seems it only happen when we use disk buffer with s3 sink. when using memory buffer flush triggers immediately at shutdown.

jszwedko commented 10 months ago

Thanks for the additional detail. This seems like a bug then since it was working when we verified #11405

It seems it only happen when we use disk buffer with s3 sink. when using memory buffer flush triggers immediately at shutdown.

I was gonna ask about that actually. When Vector is shutting down it will attempt to flush the buffers, including disk buffers. In the future, we'd like to have the disk buffers just close during shutdown rather than trying to flush. Do you think, in your case, that it could be waiting to try to flush all of the data out of the disk buffer? Or do you think the disk buffer is empty but it is still waiting to shutdown until the batch timeouts trigger?

anil-db commented 10 months ago

Thanks for the additional detail. This seems like a bug then since it was working when we verified #11405

It seems it only happen when we use disk buffer with s3 sink. when using memory buffer flush triggers immediately at shutdown.

I was gonna ask about that actually. When Vector is shutting down it will attempt to flush the buffers, including disk buffers. In the future, we'd like to have the disk buffers just close during shutdown rather than trying to flush. Do you think, in your case, that it could be waiting to try to flush all of the data out of the disk buffer? Or do you think the disk buffer is empty but it is still waiting to shutdown until the batch timeouts trigger?

we had just 1 event so reader side of disk buffer was already upto date, It seems reader side of disk buffer do not get signal. while in case of memory buffer when limitedSender object drop limitedReceiver side poll get finished. this signaling is not happening in disk buffer, as per my understanding.

jszwedko commented 10 months ago

Interesting, thanks for the context. I updated the title since I think it affects more than just the AWS S3 sink.

danielloader commented 2 months ago

Did this get clarified? Kind of worrying that flushing logs is a potential pain point here.

calebschoepp commented 1 month ago

I'm also curious on the status of this. This potentially affects a use case for Vector that I have.

benjamin-awd commented 1 month ago

I've also observed this behavior in the GCS sink when using the disk buffer:

2024-10-08T11:23:17.587387Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="gcs" time_remaining="24 seconds left"                                                                                                                 
2024-10-08T11:23:22.587394Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="gcs" time_remaining="19 seconds left"                                                                                                                 
2024-10-08T11:23:27.587294Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="gcs" time_remaining="14 seconds left"                                                                                                                 
2024-10-08T11:23:32.586562Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="gcs" time_remaining="9 seconds left"                                                                                                                  
2024-10-08T11:23:37.587402Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="gcs" time_remaining="4 seconds left"                                                                                                                  
ERROR vector::topology::running: Failed to gracefully shut down in time. Killing components. components="gcs"