Open csdenboer opened 1 year ago
This is a very timely issue! My team is debugging the exact same issue on v0.27.0. We recently upgraded from an older version, and this newer version made v2 buffers the default. We just determined that v1 buffers on 0.27.0 do work as expected, but v2 buffers have this bug.
So if you need a workaround before someone fixes v2 buffers, you can try v1 buffers instead.
This is a very timely issue! My team is debugging the exact same issue on v0.27.0. We recently upgraded from an older version, and this newer version made v2 buffers the default. We just determined that v1 buffers on 0.27.0 do work as expected, but v2 buffers have this bug.
So if you need a workaround before someone fixes v2 buffers, you can try v1 buffers instead.
Thanks for the tip! I switched buffer.type
to disk_v1
. Hopefully this will resolve it for the time being.
Hey there! Sorry to hear that both of you have been experiencing issues with data loss, potentially related to usage of the disk v2 buffers.
Looking over the report, and seeing the suggestion about using disk v1 buffers, I have to admit that this doesn't logically follow... and I'll explain why, and explain what other information I'd want to see to debug this further.
Sinks in Vector have quite a few layers, but generally it looks like this:
buffering -> batching -> request building -> sending requests
Buffering looks and functions like a first-in-first-out/ordered queue, which simply takes in events and holds them until the sink pulls them out. It doesn't care about the event contents at all: it just takes events, holds them, and allows them to be taken out.
Batching is where the event contents can begin to come into play, and indeed, in this case: the partitioning key -- key_prefix
-- is based on a given event field class
. Thus, and perhaps obviously, batches will be faceted over the unique values for the class
event field.
One thing to note is that in the case of the aws_s3
sink, if the key_prefix
cannot be rendered, perhaps because the class
field does not exist on an event: that event will be dropped.
So, all of this said, while using the disk v1 buffers may somehow ameliorate the issue, it is highly unlikely that they are directly responsible based simply on how they function... and that's why we need some more information here.
The biggest things to look into would be the component_errors_total
and component_discarded_events_total
metrics , specifically for the aws_s3
sink. That will tell us if any errors are happening that could reasonably explain why events either aren't making it into the appropriate batch, or why they're being dropped during batching/after batching but before being sent.
Next, it'd be good to look at buffer_received_events_total
, buffer_sent_events_total
, and buffer_discarded_events_total
, again, for the aws_s3
sink. These represent the number of events pushed into the buffer, pulled out of the buffer, and any events dropped by the buffer due to an internal error, respectively. These metrics can show us, primarily, what the flow of events into and out of the buffer looks like, which we would then use in conjunction with...
The component_received_events_total
and component_sent_events_total
metrics, again, you guessed it, for the aws_s3
sink. :D With all of these metrics, what we're looking to do is understand:
Beyond just the number of events, we can also use these metrics to understand the rate of change, which is potentially just as important for understanding any pathological behaviors around how the buffer interacts with the sink, and batching, and how that might be leading to unexpected stalls in batching/batch creation.
@tobz the v2 buffers are not actually dropping the event, they are just occasionally indefinitely postponing it until additional data cones in. So the issue is basically that with low volume data, timeout_secs
is not always being honored. If no additional data comes in before the system is shut down, this is where the data loss occurs.
In our case, we use vector to ship linux audit logs to s3. Our acceptance test launches a server, generates several audit events, confirms they reached s3 successfully, then terminates the server. This fails 25% of the time with the v2 buffers, but never fails with the v1 buffers. If you ssh to the server to debug, the events are often then immediately flushed to s3 due to the audit events generated by the ssh connection.
I'll look into getting you those metrics this week, but I can say that there is definitely a very low volume of events in our case and our timeout is set to 5 minutes.
@tobz I don't think it's caused by a missing class
field, because the other sink in which I found the missing data is ElasticSearch configured with bulk.index = "{{ class }}"
.
I think our situation may be a bit different from what @PaulFurtado describes. I see files with first timestamp 2023-02-27T08:00:20.064Z
and thousands of events after the batch.timeout_secs
of 4 hours (> 2023-02-27T12:00:20.064Z
).
@tobz the v2 buffers are not actually dropping the event, they are just occasionally indefinitely postponing it until additional data cones in.
@PaulFurtado Yeah, based on what @csdenboer is describing, it sounds like you're encountering a different issue. I'd recommend opening a new issue, which can be a copypaste of what you wrote here plus any relevant metrics you can gather.
Reading up on this issue, it looks like setting aside the v1/v2 red herring, Toby eloquently asked for some additional data collection from the original report.
@csdenboer , are you still seeing this issue with the latest version of Vector and/or have you been able to take a look at the internal metrics providing Toby's guidance ?
A note for the community
Problem
I am currently having multiple issues with the S3 sink. At least one of them is causing data to be lost (!).
Given that
batch.timeout_secs
is set to 4 hours, I expect a new file for each "class" (seekey_prefix
) once every 4 hours. However, I quite often see files that contain data for a longer period than 4 hours, e.g. first timestamp 2023-03-02T08:00:28.881Z and last timestamp 2023-03-02T13:55:16.055Z.Another issue on which I cannot really put my finger is that it seems that sometimes data is not added to a new batch of class A untill the current batch of class B is flushed. Example:
February 27, 2023, 09:00: both classes flushed data to S3 February 27, 2023, 16:03: class A flushed to S3 February 27, 2023, 16:33: class B flushed to S3
the next file of class A written to S3 does NOT contain data between February 27, 2023, 16:03 and February 27, 2023, 16:33. I am very sure there was data, because there is another output configured in which the data is available. It could be a coincidence that after class B is flushed, class A starts recording again, but I think its remarkable.
Configuration
Version
vector 0.28.0 (aarch64-unknown-linux-gnu 971c594 2023-02-27)
Debug Output
Example Data
No response
Additional Context
No response
References
No response