Open m1cha3lf opened 9 months ago
Hi,
any suggestion what the expected behavior is? I might be able to provide a PR.
A simple solution might be to change u16 to u32, but this would only delay this kind of issue and not really solve.
The curent situation has 3 problms as far as i see.
1) no new data is coming in without an alert / error somewhere 2) a restart doesn't solve and you still don't know what happens 3) the last parts of the previous buffer files aren't processed - which is a data loss
Michael
@jszwedko i'm adding you as a kindly reminder for this issue. Any preference how to solve this?
This limitation is not described in the user documentation. However, it's described here: https://github.com/vectordotdev/vector/blob/master/lib/vector-buffers/src/variants/disk_v2/mod.rs#L15 . According to the description, such a limitation is a design choice.
I guess describing these limitations in the user-facing documentation would be a good idea.
Anyway, the most interesting part here is that seems like in your case you don't exceed the maximum buffer size. But for some reason, the buffer file number has reached the max point. Maybe some internal quirk/bug is hidden somewhere here.
i think there are multiple circumstances.
drwxr-xr-x 2 vector vector 4.0K Feb 7 09:07 .
drwxr-xr-x 8 vector vector 4.0K Jan 29 22:26 ..
-rw-r----- 1 vector vector 124M Feb 7 09:08 buffer-data-33925.dat
-rw-r----- 1 vector vector 24 Feb 7 09:08 buffer.db
-rw-r--r-- 1 vector vector 0 Feb 7 09:08 buffer.lock
This is copied (old date) from a high throughput vector instance. On new sinks the filename starts with buffer-data-1.dat buffer-data-2.dat etc.
The files which are successfully processed are deleted and disappear. However the filename counter increases - even after restarts it continues to grow. It means after around 8TB written buffer files vector stopps working even though there might be only 2-3 active files in the folder.
From my observation and local tests i can confirm that this is always the case. As soon as it reaches the filenumber 65534 it stopps working.
2) if vector is at the point that it reached it's maxmium file number it stopps processing new events without warning / error. On restarts there's also no message which indicates an issue - it just spins to 100% CPU usage.
The only way to recover vector is to delete the buffer folder at all and restart the process. Which comes with a data-loss.
Having this behavior documented would bring transparency but wouldn't solve this in my opinion.
@m1cha3lf @jszwedko @zamazan4ik we are experiencing this, and other erratic behavior with disk
buffers at scale. We also notice that sometimes when a buffer gets corrupted, new pods that spin up to try to pick up where it left off show this error:
2024-02-12T05:58:37.912771Z ERROR vector::topology::builder: Configuration error. error=Sink "datadog_egress": error occurred when building buffer: failed to build individual stage 0: failed to seek to position where reader left off: failed to decoded record: InvalidProtobufPayload
The only way to fix it is to do what @m1cha3lf said: delete the buffer and restart vector. What are your recommendations for mitigating this? We are considering having to switch to memory buffers to avoid this, but that would come at the expense of log durability and we're not sure that will fit our business use case.
@m1cha3lf
Having this behavior documented would bring transparency but wouldn't solve this in my opinion.
Completely agree. My comment was about copying the referenced design choices from the source code to the user-visible place - the official Vector documentation. And still, the current behavior that you see is not ok and needs to be resolved anyway. That's two separate issues: fixing the bug and improving the documentation.
What are your recommendations for mitigating this?
Actually, from my experience with Vector and facing in the issue tracker issues about buffer corruption - just remove the buffers on-demand when you meet corrupted buffers. Yes, it's not convenient. Yes, it would be much better to investigate why buffer corruption happens - but we have what we have for now. IMHO - dealing from time to time with corrupted buffers is a better choice than completely switching to the memory buffers (for the same reasons as you already mentioned - log durability concern).
How often do you get corrupted buffers in practice? Well, in our production environment (not so large deployment yet tbh but anyway), we never get them (our Vector is somewhere 0.32-0.33, I don't remember more precisely). I do not recommend you switch to the older versions - just try to play around with the issue with the corrupted buffer deletion routine.
@zamazan4ik we have a pretty large volume compared to you I suppose... We process petabytes of data per month and we're seeing these errors way more frequently. I think this is a larger issue with vector, and questions whether or not it is ready for managing log data at scale.
There are a number of issues regarding disk
buffers and these types of errors, such as this one here that don't seem to have been resolved. Having to manually intervene whenever there is a corrupt buffer (especially when you use autoscaling and more buffers get created on demand) is not a sustainable solution
i think we have to differentiate the root causes a little bit. If we agree on how to handle them i'm happy to provide a PR.
1) vector stops working after reaching buffer filename 65535. This one is pretty isolated and easy fixable in my opinion
The file ID counter is defined as u16
therefore ends at 65535. Regardless of how many files are in the folder. After 8 TB (128MB file size * 65535) of written buffer vector stops processing.
A possible solution is to change u16 to u32 and have some warning messages when it's close to reach the limit. It's important to not mix this limit with having 65k active files in the folder. This issue appears even if only 1 buffer file exists, because it counts +1 on every file processed.
Alternative solution is to use random strings or UUIDs instead of numbers.
2) data-loss on buffer files after corruption
Like described in 1) vector stops completely working with the buffer even though not everything was procesed. This might happen as well on InvalidProtobufPayload
i assume.
@hillmandj can you share some data to reproduce the issue you're facing? Maybe we can write a small reproducer which helps to find the root cause.
@m1cha3lf what data do you feel would be helpful? I can share our buffer configuration, etc. but there's not much I can share that isn't private. I'm a bit confused with respect to 1. Even if we increase it to u32, as mentioned in your previous comment that wouldn't necessarily "solve" the problem. Just give more time for vector to crash. It would also presumably take up more space on the host (not that this is a major concern for us).
I think vector should not create a state where any buffer causes a perpetual FATAL scenario where the previous workings of a pod/buffer causes a new one to crash. Obviously, it makes sense that if a buffer can't flush that you'll lose data if vector was to just "delete" a corrupt buffer, but perhaps if there was a configuration/setting to allow that to happen it would be up to the user to decide how buffer corruption is handled
Another potentially related issue: https://github.com/vectordotdev/vector/issues/19379
By design, disk buffers do not just arbitrarily stop at 8TB or 65,535 data files, just to be clear. That limitation is speaking only about the maximum active size of a disk buffer: you can only have up to 65,535 data files outstanding at any given time, and thus based on their maximum size, up to 8TB of data buffered at any given time.
The behavior does seems very suspicious, though, given that it's stopping right at 65,534... which is suspicious because we do use modulo division to wrap file IDs back around after hitting the maximum... but this will lead to the highest file ID only ever being 65,534 rather than the maximum of u16
itself: 65,535.
So we're potentially mixing up overflow behavior somewhere -- one area using modulo, and one area using the underlying mechanical limits of the data type -- and that could potentially be leading to the reader waiting for data file nr. 65,535 to be created, while the writer moves from data file nr. 65,534 to data file nr. 0.
It's likely that our model testing/unit tests don't cover this because we use an arbitrarily low maximum file ID (32) in tests, in order to speed them up... where we would never get close to the data type's mechanical limits and would always hit that modulo divisor first.
That limitation is speaking only about the maximum active size of a disk buffer
@tobz just to clarify, none of our disk buffers are configured to 8TB, it's way lower than that. So the 8TB limit is essentially a cap on the internal vector buffer database / state machine right? i.e. its a limit to help vector function, not store our data necessarily. So if the files keep increasing, does that mean that our buffers (despite being configured way lower) are never flushing?
It appears that what we need here is a way for us not to create a FATAL state in which a disk
buffer requires to be wiped away entirely for pod to function. Buffer corruption errors should be handled, or at least give the users a choice as to how to handle it (i.e. crash, delete the whole buffer, or even better: delete only the file(s) that are corrupted)
So the 8TB limit is essentially a cap on the internal vector buffer database / state machine right?
It's a limit on how many buffer data files can exist on disk at any moment. If you had a sink, for example, that was permanently blocked trying to send, and so the buffer just continued to grow and grow... it would be limited to 8TB (again, based on the maximum number of data files times their maximum allowed size.)
So if the files keep increasing, does that mean that our buffers (despite being configured way lower) are never flushing?
No. File IDs are incremented monotonically until they wrap at the limit. This means that even if the first file -- buffer-data-0.dat
-- is fully processed and can be deleted... the next data file won't reuse that file ID, but will instead roll over to buffer-data-1.dat
, and so on.
It appears that what we need here is a way for us not to create a FATAL state in which a disk buffer requires to be wiped away entirely for pod to function. Buffer corruption errors should be handled, or at least give the users a choice as to how to handle it (i.e. crash, delete the whole buffer, or even better: delete only the file(s) that are corrupted)
Well, yes. This is a bug, and we obviously don't intend for users to encounter bugs and have to do a bunch of manual remediation, especially when that remediation involves potentially losing data.
@tobz thanks for the clarification. What is the priority of addressing this and other disk
buffer related bugs in Vector?
I'll have to defer to @jszwedko here.
Our internal team structure around Vector has changed recently, so it's harder to figure out where time can be budgeted.
I just encountered this bug on a production Vector instance. Given that it leads to data loss, is there any update on fixing this?
I just encountered this bug on a production Vector instance. Given that it leads to data loss, is there any update on fixing this?
Just to confirm, are you hitting same error with buffer id wrap around? I.e. do you have a buffer file with 65534
in the name?
I just encountered this bug on a production Vector instance. Given that it leads to data loss, is there any update on fixing this?
Just to confirm, are you hitting same error with buffer id wrap around? I.e. do you have a buffer file with
65534
in the name?
Unfortunately I already cleared the buffer directory and restarted Vector, so I can't double check, but the other symptoms were the same:
I just encountered this bug on a production Vector instance. Given that it leads to data loss, is there any update on fixing this?
Just to confirm, are you hitting same error with buffer id wrap around? I.e. do you have a buffer file with
65534
in the name?Unfortunately I already cleared the buffer directory and restarted Vector, so I can't double check, but the other symptoms were the same:
- "Source send cancelled." log messages.
- Throughput came to a halt.
Gotcha, thanks! Just asking because there are a couple of other issues that have similar symptoms:
So just wanted to confirm if this is the one you were running into and not one of the others.
Based on the contents of the logs, I'm reasonably certain that it is this issue in particular.
Any update, by chance? We just got bit by this bug again.
+1 - This is starting to be a regular occurrence in our environment.
Hi all, apologies for the delayed response on this issue :(
I will create a custom build to quickly hit the data file limit and observe the behavior around that time. I will share my findings with you as soon as possible.
I have configured a large disk buffer to ensure no data is lost.
Allow me to share some general thoughts here to manage expectations around the usage of buffers:
Relying on a large buffer size is not guaranteed to prevent data loss. Even if we supported arbitrarily large buffers, we would be bounded by disk space. A large buffer can help, but it’s not a foolproof solution in itself.
However, with some fine tuning of the rates of produced events alongside the rates of consumed events we can reach a good enough solution. Proactive monitoring and alerting if these become imbalanced -in addition to monitoring Vector buffer metrics- can help.
Update; I used the following config:
data_dir: "/Users/pavlos.rontidis/CLionProjects/vector/pront_tests/data_dir"
schema:
log_namespace: false
api:
enabled: true
sources:
source0:
format: json
interval: 0.001
type: demo_logs
decoding:
codec: json
sinks:
sink0:
inputs:
- source0
type: console
encoding:
codec: json
buffer:
- type: "disk"
max_size: 268435488
when_full: "drop_newest"
And I hardcoded some internals specifically:
max_data_file_size = 1500
max_record_size = 500
I observed the following:
buffer-data-N.dat
file kept updating (N increasing) until it reached u16::MAX
I might go back and try different configurations but this doesn't seem trivial to reproduce.
I've also started witnessing this in 0.41.1 on a transform that outputs to an HTTP sink. The sink was configured with a 1Gi limit but on disk it was only using about 20 megs.
I worked around it by setting the buffer type to memory, which is preferable to Vector crashlooping.
I've also started witnessing this in 0.41.1 on a transform that outputs to an HTTP sink. The sink was configured with a 1Gi limit but on disk it was only using about 20 megs.
Can you share your config? Maybe it will help us reproduce this in a dev environment.
I've also started witnessing this in 0.41.1 on a transform that outputs to an HTTP sink. The sink was configured with a 1Gi limit but on disk it was only using about 20 megs.
Can you share your config? Maybe it will help us reproduce this in a dev environment.
I'll see if I can clean up the config to sanitize/anonymize it. For what it's worth, we have 56 Vector instances deployed with this configuration and only 1 of them is actually exhibiting the behavior in question.
I've also started witnessing this in 0.41.1 on a transform that outputs to an HTTP sink. The sink was configured with a 1Gi limit but on disk it was only using about 20 megs.
Can you share your config? Maybe it will help us reproduce this in a dev environment.
I'll see if I can clean up the config to sanitize/anonymize it. For what it's worth, we have 56 Vector instances deployed with this configuration and only 1 of them is actually exhibiting the behavior in question.
This seems to correlate with the output encountering errors. The last stage in our pipeline is an HTTP output, which periodically returns errors.
A note for the community
Problem
Hi,
We have multiple vector instances which are sending thier data to another central vector instance where I do some transformation. Basically N small vector instances are sending their requests to a big instance.
I have configured a large disk buffer to ensure no data is lost. Everything was working fine until 19:42 where everything stopped out of the sudden. The process itself was still up but nothing was processed anymore.
Vector logs were showing errors like that
The time vector stopped working is exact the time it would reach the file number 65535 which sounds like reaching a configured limit.
A restart from vector didn't solve the issue. I had to delete the buffer files and start again to bring it back.
I hope that helps to find the root cause.
Regards,
Michael
Configuration
Version
$ vector --version vector 0.34.2 (x86_64-unknown-linux-gnu d685a16 2024-01-02 14:59:54.890517436)
Debug Output
No response
Example Data
No response
Additional Context
I assume the reason is this
pub fn get_data_file_path(&self, file_id: u16) -> PathBuf {
References
No response