vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
18.3k stars 1.61k forks source link

S3 sink loses data (!) #16666

Open csdenboer opened 1 year ago

csdenboer commented 1 year ago

A note for the community

Problem

I am currently having multiple issues with the S3 sink. At least one of them is causing data to be lost (!).

Given that batch.timeout_secs is set to 4 hours, I expect a new file for each "class" (see key_prefix) once every 4 hours. However, I quite often see files that contain data for a longer period than 4 hours, e.g. first timestamp 2023-03-02T08:00:28.881Z and last timestamp 2023-03-02T13:55:16.055Z.

Another issue on which I cannot really put my finger is that it seems that sometimes data is not added to a new batch of class A untill the current batch of class B is flushed. Example:

February 27, 2023, 09:00: both classes flushed data to S3 February 27, 2023, 16:03: class A flushed to S3 February 27, 2023, 16:33: class B flushed to S3

the next file of class A written to S3 does NOT contain data between February 27, 2023, 16:03 and February 27, 2023, 16:33. I am very sure there was data, because there is another output configured in which the data is available. It could be a coincidence that after class B is flushed, class A starts recording again, but I think its remarkable.

Configuration

[sinks.s3]
type                        = "aws_s3"
inputs                      = ["transform"]
acknowledgements.enabled    = true
region                      = "eu-central-1"
bucket                      = "bucket"
key_prefix                  = "{{ class }}/year=%Y/month=%m/day=%d/"
acl                         = "private"
compression                 = "gzip"        
framing.method              = "newline_delimited"  
encoding.codec              = "json"                                          # ...JSON
batch.max_bytes             = 134217728                        
batch.timeout_secs          = 14400                                           # 4 hours
buffer.type                 = "disk"                      
buffer.max_size             = 1073741824                                      # 1GB
buffer.when_full            = "block"                     
encoding.except_fields      = ["class", "message"]     # remove class and save bytes

Version

vector 0.28.0 (aarch64-unknown-linux-gnu 971c594 2023-02-27)

Debug Output

Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.196600Z DEBUG sink{component_kind="sink" component_id=s3 component_type=aws_s3 component_name=s3}:request{request_id=2}: aws_endpoint: resolved endpoint endpoint=Endpoint { url: "https://s3.eu-central-1.amazonaws.com/", headers: {}, properties: {"authSchemes": Array([Object({"name": String("sigv4"), "signingRegion": String("eu-central-1")})])} } base_region=Some(SigningRegion("eu-central-1"))
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.196683Z DEBUG sink{component_kind="sink" component_id=s3 component_type=aws_s3 component_name=s3}:request{request_id=2}:send_operation{operation="PutObject" service="s3"}:provide_credentials{provider=default_chain}: aws_config::meta::credentials::lazy_caching: loaded credentials from cache
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.197617Z DEBUG hyper::client::connect::dns: resolving host="s3.eu-central-1.amazonaws.com"
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.199206Z DEBUG sink{component_kind="sink" component_id=s3 component_type=aws_s3 component_name=s3}:request{request_id=2}:send_operation{operation="PutObject" service="s3"}: hyper::client::connect::http: connecting to 52.219.171.93:443
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.200868Z DEBUG sink{component_kind="sink" component_id=s3 component_type=aws_s3 component_name=s3}:request{request_id=2}:send_operation{operation="PutObject" service="s3"}: hyper::client::connect::http: connected to 52.219.171.93:443
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.205651Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.205704Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.205725Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.205748Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.205766Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.205784Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.206830Z DEBUG hyper::proto::h1::io: flushed 32768 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.206922Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.206951Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.206970Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.206989Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.207018Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.207036Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.207339Z DEBUG hyper::proto::h1::io: flushed 32768 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.207366Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.207388Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.207420Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.207443Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.207491Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.207509Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.207565Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.207583Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.207622Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.207640Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.207669Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.209554Z DEBUG hyper::proto::h1::io: flushed 32768 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.209588Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.209650Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.209679Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.209708Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.209735Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.209768Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.209788Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.209809Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.209827Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.209846Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.209866Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.211151Z DEBUG hyper::proto::h1::io: flushed 32768 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.211178Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.211194Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.211253Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.211270Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.211294Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.211312Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.211330Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.211363Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.211381Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.212711Z DEBUG hyper::proto::h1::io: flushed 32768 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.212751Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.212814Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.212840Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.212861Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.212881Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.212902Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.212926Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.212948Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.212970Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.212992Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.214245Z DEBUG hyper::proto::h1::io: flushed 32768 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.214275Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.214306Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.214325Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.214342Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.214361Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.214381Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.214421Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.214441Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.214462Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.214482Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.215590Z DEBUG hyper::proto::h1::io: flushed 20168 bytes
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.280860Z DEBUG hyper::proto::h1::io: parsed 7 headers
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.280882Z DEBUG hyper::proto::h1::conn: incoming body is empty
Feb 27 15:33:15 ip-10-0-18-133 vector: 2023-02-27T15:33:15.280926Z DEBUG sink{component_kind="sink" component_id=s3 component_type=aws_s3 component_name=s3}:request{request_id=2}:send_operation{operation="PutObject" service="s3"}: hyper::client::pool: pooling idle connection for ("https", s3.eu-central-1.amazonaws.com)

Example Data

No response

Additional Context

No response

References

No response

PaulFurtado commented 1 year ago

This is a very timely issue! My team is debugging the exact same issue on v0.27.0. We recently upgraded from an older version, and this newer version made v2 buffers the default. We just determined that v1 buffers on 0.27.0 do work as expected, but v2 buffers have this bug.

So if you need a workaround before someone fixes v2 buffers, you can try v1 buffers instead.

csdenboer commented 1 year ago

This is a very timely issue! My team is debugging the exact same issue on v0.27.0. We recently upgraded from an older version, and this newer version made v2 buffers the default. We just determined that v1 buffers on 0.27.0 do work as expected, but v2 buffers have this bug.

So if you need a workaround before someone fixes v2 buffers, you can try v1 buffers instead.

Thanks for the tip! I switched buffer.type to disk_v1. Hopefully this will resolve it for the time being.

tobz commented 1 year ago

Hey there! Sorry to hear that both of you have been experiencing issues with data loss, potentially related to usage of the disk v2 buffers.

Looking over the report, and seeing the suggestion about using disk v1 buffers, I have to admit that this doesn't logically follow... and I'll explain why, and explain what other information I'd want to see to debug this further.

Sinks in Vector have quite a few layers, but generally it looks like this:

buffering -> batching -> request building -> sending requests

Buffering looks and functions like a first-in-first-out/ordered queue, which simply takes in events and holds them until the sink pulls them out. It doesn't care about the event contents at all: it just takes events, holds them, and allows them to be taken out.

Batching is where the event contents can begin to come into play, and indeed, in this case: the partitioning key -- key_prefix -- is based on a given event field class. Thus, and perhaps obviously, batches will be faceted over the unique values for the class event field.

One thing to note is that in the case of the aws_s3 sink, if the key_prefix cannot be rendered, perhaps because the class field does not exist on an event: that event will be dropped.

So, all of this said, while using the disk v1 buffers may somehow ameliorate the issue, it is highly unlikely that they are directly responsible based simply on how they function... and that's why we need some more information here.

The biggest things to look into would be the component_errors_total and component_discarded_events_total metrics , specifically for the aws_s3 sink. That will tell us if any errors are happening that could reasonably explain why events either aren't making it into the appropriate batch, or why they're being dropped during batching/after batching but before being sent.

Next, it'd be good to look at buffer_received_events_total, buffer_sent_events_total, and buffer_discarded_events_total, again, for the aws_s3 sink. These represent the number of events pushed into the buffer, pulled out of the buffer, and any events dropped by the buffer due to an internal error, respectively. These metrics can show us, primarily, what the flow of events into and out of the buffer looks like, which we would then use in conjunction with...

The component_received_events_total and component_sent_events_total metrics, again, you guessed it, for the aws_s3 sink. :D With all of these metrics, what we're looking to do is understand:

Beyond just the number of events, we can also use these metrics to understand the rate of change, which is potentially just as important for understanding any pathological behaviors around how the buffer interacts with the sink, and batching, and how that might be leading to unexpected stalls in batching/batch creation.

PaulFurtado commented 1 year ago

@tobz the v2 buffers are not actually dropping the event, they are just occasionally indefinitely postponing it until additional data cones in. So the issue is basically that with low volume data, timeout_secs is not always being honored. If no additional data comes in before the system is shut down, this is where the data loss occurs.

In our case, we use vector to ship linux audit logs to s3. Our acceptance test launches a server, generates several audit events, confirms they reached s3 successfully, then terminates the server. This fails 25% of the time with the v2 buffers, but never fails with the v1 buffers. If you ssh to the server to debug, the events are often then immediately flushed to s3 due to the audit events generated by the ssh connection.

I'll look into getting you those metrics this week, but I can say that there is definitely a very low volume of events in our case and our timeout is set to 5 minutes.

csdenboer commented 1 year ago

@tobz I don't think it's caused by a missing class field, because the other sink in which I found the missing data is ElasticSearch configured with bulk.index = "{{ class }}".

I think our situation may be a bit different from what @PaulFurtado describes. I see files with first timestamp 2023-02-27T08:00:20.064Z and thousands of events after the batch.timeout_secs of 4 hours (> 2023-02-27T12:00:20.064Z).

tobz commented 1 year ago

@tobz the v2 buffers are not actually dropping the event, they are just occasionally indefinitely postponing it until additional data cones in.

@PaulFurtado Yeah, based on what @csdenboer is describing, it sounds like you're encountering a different issue. I'd recommend opening a new issue, which can be a copypaste of what you wrote here plus any relevant metrics you can gather.

neuronull commented 1 year ago

Reading up on this issue, it looks like setting aside the v1/v2 red herring, Toby eloquently asked for some additional data collection from the original report.

@csdenboer , are you still seeing this issue with the latest version of Vector and/or have you been able to take a look at the internal metrics providing Toby's guidance ?