vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
18.13k stars 1.6k forks source link

Unknown buffer disk limitation #19759

Open m1cha3lf opened 9 months ago

m1cha3lf commented 9 months ago

A note for the community

Problem

Hi,

We have multiple vector instances which are sending thier data to another central vector instance where I do some transformation. Basically N small vector instances are sending their requests to a big instance.

I have configured a large disk buffer to ensure no data is lost. Everything was working fine until 19:42 where everything stopped out of the sudden. The process itself was still up but nothing was processed anymore.

Vector logs were showing errors like that

Jan 31 19:42:13 vector-s3writer-bm vector[88628]: 2024-01-31T19:42:13.252148Z ERROR source{component_kind="source" component_id=requests10 component_type=vector}:grpc-request{grpc_service="vector.Vector" grpc_method="PushEvents"}: vector_common::internal_event::component_events_dropped: Events dropped intentional=false count=2798 reason="Source send cancelled." internal_log_rate_limit=true

$ ls -lah /var/lib/vector/buffer/v2/requests-vector-internal/ total 1.3G drwxr-xr-x 2 vector vector 4.0K Jan 31 19:41 . drwxr-xr-x 9 vector vector 4.0K Jan 30 15:19 .. -rw-r----- 1 vector vector 127M Jan 30 15:19 buffer-data-0.dat -rw-r----- 1 vector vector 126M Jan 30 15:19 buffer-data-1.dat -rw-r----- 1 vector vector 127M Jan 30 15:19 buffer-data-10.dat -rw-r----- 1 vector vector 126M Jan 30 15:19 buffer-data-12.dat -rw-r----- 1 vector vector 128M Jan 30 15:19 buffer-data-2.dat -rw-r----- 1 vector vector 127M Jan 30 15:19 buffer-data-20.dat -rw-r----- 1 vector vector 125M Jan 30 15:19 buffer-data-23.dat -rw-r----- 1 vector vector 126M Jan 30 15:19 buffer-data-3.dat -rw-r----- 1 vector vector 128M Jan 30 15:19 buffer-data-4.dat -rw-r----- 1 vector vector 127M Jan 31 19:41 buffer-data-65534.dat -rw-r----- 1 vector vector 24 Jan 31 19:52 buffer.db -rw-r--r-- 1 vector vector 0 Jan 31 19:52 buffer.lock

The time vector stopped working is exact the time it would reach the file number 65535 which sounds like reaching a configured limit.

A restart from vector didn't solve the issue. I had to delete the buffer files and start again to bring it back.

I hope that helps to find the root cause.

Regards,

Michael

Configuration

requests-vector-internal:
    type: vector
    compression: true
    healthcheck: false
    request:
      retry_attempts: 100
      timeout_secs: 60
      retry_max_duration_secs: 10
    batch:
      max_bytes: 10000000
      max_events: 30000
      timeout_secs: 10
    buffer:
      type: "disk"
      max_size: 159684354880
      when_full: "drop_newest"
    inputs:
      - requests10
    address: "127.0.0.1:9011"

Version

$ vector --version vector 0.34.2 (x86_64-unknown-linux-gnu d685a16 2024-01-02 14:59:54.890517436)

Debug Output

No response

Example Data

No response

Additional Context

I assume the reason is this pub fn get_data_file_path(&self, file_id: u16) -> PathBuf {

References

No response

m1cha3lf commented 9 months ago

Hi,

any suggestion what the expected behavior is? I might be able to provide a PR.

A simple solution might be to change u16 to u32, but this would only delay this kind of issue and not really solve.

The curent situation has 3 problms as far as i see.

1) no new data is coming in without an alert / error somewhere 2) a restart doesn't solve and you still don't know what happens 3) the last parts of the previous buffer files aren't processed - which is a data loss

Michael

m1cha3lf commented 9 months ago

@jszwedko i'm adding you as a kindly reminder for this issue. Any preference how to solve this?

zamazan4ik commented 9 months ago

This limitation is not described in the user documentation. However, it's described here: https://github.com/vectordotdev/vector/blob/master/lib/vector-buffers/src/variants/disk_v2/mod.rs#L15 . According to the description, such a limitation is a design choice.

I guess describing these limitations in the user-facing documentation would be a good idea.

zamazan4ik commented 9 months ago

Anyway, the most interesting part here is that seems like in your case you don't exceed the maximum buffer size. But for some reason, the buffer file number has reached the max point. Maybe some internal quirk/bug is hidden somewhere here.

m1cha3lf commented 9 months ago

i think there are multiple circumstances.

drwxr-xr-x 2 vector vector 4.0K Feb  7 09:07 .
drwxr-xr-x 8 vector vector 4.0K Jan 29 22:26 ..
-rw-r----- 1 vector vector 124M Feb  7 09:08 buffer-data-33925.dat
-rw-r----- 1 vector vector   24 Feb  7 09:08 buffer.db
-rw-r--r-- 1 vector vector    0 Feb  7 09:08 buffer.lock

This is copied (old date) from a high throughput vector instance. On new sinks the filename starts with buffer-data-1.dat buffer-data-2.dat etc.

The files which are successfully processed are deleted and disappear. However the filename counter increases - even after restarts it continues to grow. It means after around 8TB written buffer files vector stopps working even though there might be only 2-3 active files in the folder.

From my observation and local tests i can confirm that this is always the case. As soon as it reaches the filenumber 65534 it stopps working.

2) if vector is at the point that it reached it's maxmium file number it stopps processing new events without warning / error. On restarts there's also no message which indicates an issue - it just spins to 100% CPU usage.

The only way to recover vector is to delete the buffer folder at all and restart the process. Which comes with a data-loss.

Having this behavior documented would bring transparency but wouldn't solve this in my opinion.

hillmandj commented 9 months ago

@m1cha3lf @jszwedko @zamazan4ik we are experiencing this, and other erratic behavior with disk buffers at scale. We also notice that sometimes when a buffer gets corrupted, new pods that spin up to try to pick up where it left off show this error:

2024-02-12T05:58:37.912771Z ERROR vector::topology::builder: Configuration error. error=Sink "datadog_egress": error occurred when building buffer: failed to build individual stage 0: failed to seek to position where reader left off: failed to decoded record: InvalidProtobufPayload

The only way to fix it is to do what @m1cha3lf said: delete the buffer and restart vector. What are your recommendations for mitigating this? We are considering having to switch to memory buffers to avoid this, but that would come at the expense of log durability and we're not sure that will fit our business use case.

zamazan4ik commented 9 months ago

@m1cha3lf

Having this behavior documented would bring transparency but wouldn't solve this in my opinion.

Completely agree. My comment was about copying the referenced design choices from the source code to the user-visible place - the official Vector documentation. And still, the current behavior that you see is not ok and needs to be resolved anyway. That's two separate issues: fixing the bug and improving the documentation.

What are your recommendations for mitigating this?

Actually, from my experience with Vector and facing in the issue tracker issues about buffer corruption - just remove the buffers on-demand when you meet corrupted buffers. Yes, it's not convenient. Yes, it would be much better to investigate why buffer corruption happens - but we have what we have for now. IMHO - dealing from time to time with corrupted buffers is a better choice than completely switching to the memory buffers (for the same reasons as you already mentioned - log durability concern).

How often do you get corrupted buffers in practice? Well, in our production environment (not so large deployment yet tbh but anyway), we never get them (our Vector is somewhere 0.32-0.33, I don't remember more precisely). I do not recommend you switch to the older versions - just try to play around with the issue with the corrupted buffer deletion routine.

hillmandj commented 9 months ago

@zamazan4ik we have a pretty large volume compared to you I suppose... We process petabytes of data per month and we're seeing these errors way more frequently. I think this is a larger issue with vector, and questions whether or not it is ready for managing log data at scale.

There are a number of issues regarding disk buffers and these types of errors, such as this one here that don't seem to have been resolved. Having to manually intervene whenever there is a corrupt buffer (especially when you use autoscaling and more buffers get created on demand) is not a sustainable solution

m1cha3lf commented 9 months ago

i think we have to differentiate the root causes a little bit. If we agree on how to handle them i'm happy to provide a PR.

1) vector stops working after reaching buffer filename 65535. This one is pretty isolated and easy fixable in my opinion

https://github.com/vectordotdev/vector/blob/8d897af2f621a0402678141a3a94e1196ea56037/lib/vector-buffers/src/variants/disk_v2/ledger.rs#L122

The file ID counter is defined as u16 therefore ends at 65535. Regardless of how many files are in the folder. After 8 TB (128MB file size * 65535) of written buffer vector stops processing.

A possible solution is to change u16 to u32 and have some warning messages when it's close to reach the limit. It's important to not mix this limit with having 65k active files in the folder. This issue appears even if only 1 buffer file exists, because it counts +1 on every file processed.

Alternative solution is to use random strings or UUIDs instead of numbers.

2) data-loss on buffer files after corruption

Like described in 1) vector stops completely working with the buffer even though not everything was procesed. This might happen as well on InvalidProtobufPayload i assume.

@hillmandj can you share some data to reproduce the issue you're facing? Maybe we can write a small reproducer which helps to find the root cause.

hillmandj commented 9 months ago

@m1cha3lf what data do you feel would be helpful? I can share our buffer configuration, etc. but there's not much I can share that isn't private. I'm a bit confused with respect to 1. Even if we increase it to u32, as mentioned in your previous comment that wouldn't necessarily "solve" the problem. Just give more time for vector to crash. It would also presumably take up more space on the host (not that this is a major concern for us).

I think vector should not create a state where any buffer causes a perpetual FATAL scenario where the previous workings of a pod/buffer causes a new one to crash. Obviously, it makes sense that if a buffer can't flush that you'll lose data if vector was to just "delete" a corrupt buffer, but perhaps if there was a configuration/setting to allow that to happen it would be up to the user to decide how buffer corruption is handled

hillmandj commented 9 months ago

Another potentially related issue: https://github.com/vectordotdev/vector/issues/19379

tobz commented 9 months ago

By design, disk buffers do not just arbitrarily stop at 8TB or 65,535 data files, just to be clear. That limitation is speaking only about the maximum active size of a disk buffer: you can only have up to 65,535 data files outstanding at any given time, and thus based on their maximum size, up to 8TB of data buffered at any given time.

The behavior does seems very suspicious, though, given that it's stopping right at 65,534... which is suspicious because we do use modulo division to wrap file IDs back around after hitting the maximum... but this will lead to the highest file ID only ever being 65,534 rather than the maximum of u16 itself: 65,535.

So we're potentially mixing up overflow behavior somewhere -- one area using modulo, and one area using the underlying mechanical limits of the data type -- and that could potentially be leading to the reader waiting for data file nr. 65,535 to be created, while the writer moves from data file nr. 65,534 to data file nr. 0.

It's likely that our model testing/unit tests don't cover this because we use an arbitrarily low maximum file ID (32) in tests, in order to speed them up... where we would never get close to the data type's mechanical limits and would always hit that modulo divisor first.

hillmandj commented 9 months ago

That limitation is speaking only about the maximum active size of a disk buffer

@tobz just to clarify, none of our disk buffers are configured to 8TB, it's way lower than that. So the 8TB limit is essentially a cap on the internal vector buffer database / state machine right? i.e. its a limit to help vector function, not store our data necessarily. So if the files keep increasing, does that mean that our buffers (despite being configured way lower) are never flushing?

It appears that what we need here is a way for us not to create a FATAL state in which a disk buffer requires to be wiped away entirely for pod to function. Buffer corruption errors should be handled, or at least give the users a choice as to how to handle it (i.e. crash, delete the whole buffer, or even better: delete only the file(s) that are corrupted)

tobz commented 9 months ago

So the 8TB limit is essentially a cap on the internal vector buffer database / state machine right?

It's a limit on how many buffer data files can exist on disk at any moment. If you had a sink, for example, that was permanently blocked trying to send, and so the buffer just continued to grow and grow... it would be limited to 8TB (again, based on the maximum number of data files times their maximum allowed size.)

So if the files keep increasing, does that mean that our buffers (despite being configured way lower) are never flushing?

No. File IDs are incremented monotonically until they wrap at the limit. This means that even if the first file -- buffer-data-0.dat -- is fully processed and can be deleted... the next data file won't reuse that file ID, but will instead roll over to buffer-data-1.dat, and so on.

It appears that what we need here is a way for us not to create a FATAL state in which a disk buffer requires to be wiped away entirely for pod to function. Buffer corruption errors should be handled, or at least give the users a choice as to how to handle it (i.e. crash, delete the whole buffer, or even better: delete only the file(s) that are corrupted)

Well, yes. This is a bug, and we obviously don't intend for users to encounter bugs and have to do a bunch of manual remediation, especially when that remediation involves potentially losing data.

hillmandj commented 9 months ago

@tobz thanks for the clarification. What is the priority of addressing this and other disk buffer related bugs in Vector?

tobz commented 9 months ago

I'll have to defer to @jszwedko here.

Our internal team structure around Vector has changed recently, so it's harder to figure out where time can be budgeted.

ebusto commented 3 months ago

I just encountered this bug on a production Vector instance. Given that it leads to data loss, is there any update on fixing this?

jszwedko commented 3 months ago

I just encountered this bug on a production Vector instance. Given that it leads to data loss, is there any update on fixing this?

Just to confirm, are you hitting same error with buffer id wrap around? I.e. do you have a buffer file with 65534 in the name?

ebusto commented 3 months ago

I just encountered this bug on a production Vector instance. Given that it leads to data loss, is there any update on fixing this?

Just to confirm, are you hitting same error with buffer id wrap around? I.e. do you have a buffer file with 65534 in the name?

Unfortunately I already cleared the buffer directory and restarted Vector, so I can't double check, but the other symptoms were the same:

jszwedko commented 3 months ago

I just encountered this bug on a production Vector instance. Given that it leads to data loss, is there any update on fixing this?

Just to confirm, are you hitting same error with buffer id wrap around? I.e. do you have a buffer file with 65534 in the name?

Unfortunately I already cleared the buffer directory and restarted Vector, so I can't double check, but the other symptoms were the same:

  • "Source send cancelled." log messages.
  • Throughput came to a halt.

Gotcha, thanks! Just asking because there are a couple of other issues that have similar symptoms:

So just wanted to confirm if this is the one you were running into and not one of the others.

ebusto commented 3 months ago

Based on the contents of the logs, I'm reasonably certain that it is this issue in particular.

ebusto commented 1 month ago

Any update, by chance? We just got bit by this bug again.

SniffedPacket commented 1 month ago

+1 - This is starting to be a regular occurrence in our environment.

pront commented 4 weeks ago

Hi all, apologies for the delayed response on this issue :(

I will create a custom build to quickly hit the data file limit and observe the behavior around that time. I will share my findings with you as soon as possible.


I have configured a large disk buffer to ensure no data is lost.

Allow me to share some general thoughts here to manage expectations around the usage of buffers:

Relying on a large buffer size is not guaranteed to prevent data loss. Even if we supported arbitrarily large buffers, we would be bounded by disk space. A large buffer can help, but it’s not a foolproof solution in itself.

However, with some fine tuning of the rates of produced events alongside the rates of consumed events we can reach a good enough solution. Proactive monitoring and alerting if these become imbalanced -in addition to monitoring Vector buffer metrics- can help.

pront commented 3 weeks ago

Update; I used the following config:

data_dir: "/Users/pavlos.rontidis/CLionProjects/vector/pront_tests/data_dir"

schema:
  log_namespace: false
api:
  enabled: true
sources:
  source0:
    format: json
    interval: 0.001
    type: demo_logs
    decoding:
      codec: json

sinks:
  sink0:
    inputs:
      - source0
    type: console
    encoding:
      codec: json
    buffer:
      - type: "disk"
        max_size: 268435488
        when_full: "drop_newest"

And I hardcoded some internals specifically:

max_data_file_size = 1500
max_record_size = 500

I observed the following:

I might go back and try different configurations but this doesn't seem trivial to reproduce.

erhudy commented 2 weeks ago

I've also started witnessing this in 0.41.1 on a transform that outputs to an HTTP sink. The sink was configured with a 1Gi limit but on disk it was only using about 20 megs.

I worked around it by setting the buffer type to memory, which is preferable to Vector crashlooping.

pront commented 2 weeks ago

I've also started witnessing this in 0.41.1 on a transform that outputs to an HTTP sink. The sink was configured with a 1Gi limit but on disk it was only using about 20 megs.

Can you share your config? Maybe it will help us reproduce this in a dev environment.

erhudy commented 1 week ago

I've also started witnessing this in 0.41.1 on a transform that outputs to an HTTP sink. The sink was configured with a 1Gi limit but on disk it was only using about 20 megs.

Can you share your config? Maybe it will help us reproduce this in a dev environment.

I'll see if I can clean up the config to sanitize/anonymize it. For what it's worth, we have 56 Vector instances deployed with this configuration and only 1 of them is actually exhibiting the behavior in question.

ebusto commented 1 week ago

I've also started witnessing this in 0.41.1 on a transform that outputs to an HTTP sink. The sink was configured with a 1Gi limit but on disk it was only using about 20 megs.

Can you share your config? Maybe it will help us reproduce this in a dev environment.

I'll see if I can clean up the config to sanitize/anonymize it. For what it's worth, we have 56 Vector instances deployed with this configuration and only 1 of them is actually exhibiting the behavior in question.

This seems to correlate with the output encountering errors. The last stage in our pipeline is an HTTP output, which periodically returns errors.