vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.41k stars 1.51k forks source link

Tiered buffers on overflow two bugs #20322

Open szibis opened 4 months ago

szibis commented 4 months ago

A note for the community

Problem

Two problems:

  1. buffer_type tag in metrics is always reporting memory even if buffer start using disks on overflow
  2. Some metrics with higher traffic goes to disks - as in problem 1 we don't see when reaching memory limit and start using disk because its reporting always memory buffer in internal metrics - but when reach disk this is growing and looks like not sending or not cleaned once sended from disks
image

When we look into one of the kpods we see that disk buffer is used and growing just like showed on graphs.

I have no name!@vector-metrics-egress-us-east-1d-2:/data_dir/buffer/v2/datadog_agent_misc_metrics$ ls -la
total 272280
drwxrwsr-x 2 1000 1000      4096 Apr 17 07:28 .
drwxrwsr-x 4 1000 1000      4096 Nov  7 13:27 ..
-rw-rw---- 1 1000 1000 133820872 Apr 16 22:17 buffer-data-26599.dat
-rw-rw---- 1 1000 1000 133983344 Apr 17 07:28 buffer-data-26603.dat
-rw-r----- 1 1000 1000  10990808 Apr 17 07:58 buffer-data-26604.dat
-rw-rw-r-- 1 1000 1000        24 Apr 17 08:41 buffer.db
-rw-r--r-- 1 1000 1000         0 Apr 17 03:23 buffer.lock

No additional info in logs.

Configuration

## non custom metrics comming from dd-agent - non DogstatsD
type: datadog_metrics
inputs:
  - metrics_route._unmatched
#  - datadog_agent
default_api_key: "SECRET[secrets.METRICS_EGRESS_DD_API_KEY]"
endpoint: "${DD_SITE}" # override for pvlink instead of public site option
buffer:
  - type: memory
    max_events: 50000
    when_full: overflow
  - type: disk
    max_size: 15000000000 # close to 15GB - total 9GB+ with internal metrics and we have 10GB now on volume.
    when_full: block
batch:
  max_events: 5000
  timeout_secs: 1
acknowledgements:
  enabled: False
request:
  concurrency: "adaptive"

Version

0.37

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

szibis commented 4 months ago

Even after kpod redeployment old buffers are not flushed from disks

I have no name!@vector-metrics-egress-us-west-2c-0:/data_dir/buffer/v2/datadog_agent_misc_metrics$ ls -lah
total 552M
drwxrwsr-x 2 1000 1000 4.0K Apr 17 08:57 .
drwxrwsr-x 4 1000 1000 4.0K Apr 15 11:17 ..
-rw-rw---- 1 1000 1000 125M Apr 16 14:48 buffer-data-15.dat
-rw-rw---- 1 1000 1000 128M Apr 16 16:33 buffer-data-18.dat
-rw-rw---- 1 1000 1000 126M Apr 15 19:37 buffer-data-1.dat
-rw-rw---- 1 1000 1000 126M Apr 17 04:06 buffer-data-29.dat
-rw-rw---- 1 1000 1000  48M Apr 17 08:54 buffer-data-41.dat
-rw-rw---- 1 1000 1000   24 Apr 17 08:59 buffer.db
-rw-r--r-- 1 1000 1000    0 Apr 17 08:57 buffer.lock

Now two days of buffers still on disks

szibis commented 4 months ago

ok after redeploy number of events drop properly, but buffer remain on disks - this means that after send they are not cleaned ??

image
jszwedko commented 4 months ago

Thanks for this report @szibis . I'm not actually sure when the buffer files are deleted when events are processed. @tobz is this something you know off the top of your head?

szibis commented 4 months ago

Maybe they are not processed when reaching disk buffer on overflow and thats why the files are not removed ??