Slow GCS sink on Vector-aggregator caused open file handles for deleted logs on Vector-clients

vpedosyuk commented 3 years ago

Vector Version

Vector clients:

0.17.3

Vector aggregator:

0.17.3

though the same was happening with the 0.15.1 version

Vector Configuration File

Vector client

[sources.logs]
  type = "file"
  include = ["/var/log/*.log", "/var/log/*/*.log"]
  ignore_older_secs = 3600
  ignore_not_found = true
  max_read_bytes = 2048
  fingerprint.strategy = "device_and_inode"

[transforms.logs-setup]
  type = "remap"
  inputs = ["logs"]

  source = '''
  .<additional_tag_name> = "<additional_tag_value>"
  '''

[sinks.vector-logs]
  type = "vector"
  version = "2"
  inputs = ["logs-setup"]
  address = "<vector_aggregator_address>:9000"
  healthcheck = true

  buffer.max_events = 2000000
  buffer.type = "memory"
  buffer.when_full = "block"

  batch.max_events = 20000
  batch.timeout_secs = 1

Vector aggregator:

[sources.logs]
  type = "vector"
  version = "2"
  address = "0.0.0.0:9000"

[sinks.gcs]
  type = "gcp_cloud_storage"
  inputs = ["logs"]
  healthcheck = true

  bucket = "<gcs_bucket_name>"
  credentials_path = "/etc/gcs/credentials.json"

  batch.max_size = 50000000
  batch.timeout_secs = 60

  buffer.type = "memory"
  buffer.max_events = 100000
  buffer.when_full = "block"

  encoding.codec = "text"
  compression = "none"

  key_prefix = "{{ <additional_tag_name> }}/{{ host }}{{ file }}/%F/%H_xx/"

  filename_append_uuid = false
  filename_time_format = "%H:%M:%S"

Expected Behavior

Throughput should increase with adding more CPU and RAM, Vector should handle the increased data rate.
Despite any possible issue in the Vector->Vector-aggregator pipeline, it should not damage production servers.

Actual Behavior

Throughput doesn't increase with adding more CPU and RAM, Vector got stuck in the same conditions despite a pod restart and increased cpu/mem resources.
Issues in the Vector->Vector-aggregator pipeline causes "out of space" issues on production servers

Context

We use Vector->Vector-aggregator scheme. Once the Vector-aggregator receives data from the Vector-clients, it sends logs onto a GCS bucket. There're 10 Vector clients (bare metal servers) and 1 Vector aggregator in Kubernetes.

There're 2 issues we faced:

Vector-aggregator can't scale vertically with the GCS sink. vector_utilization metric reports almost 100% utilization for thegcp_cloud_storage sink on Vector-aggregator and for the vector sink on Vector-clients, however, a very small amount of data (or no data) is actually get transmitted.
Vector-clients keep more and more file descriptors open even for already deleted logs, which leads to "out of space" problems on our servers. Though it seems to be a known issue.

Details

I tried to reproduce the issue and it turned out to be quite similar to what we've seen before.

I marked moments when I took an action:

Taken actions:

restart all 10 Vectors-clients, for some reason this triggered the issue, maybe due to a somewhat increased data rate after a Vector restart
increased pod's limits (2 cpu, 4Gi mem -> 3 cpu, 5Gi mem) of the Vector-aggregator and did a pod restart. Actually, even before this change, it wasn't using all the resources.
but Vector-aggregator still didn't want to handle all the data it was receiving, it wasn't using all the allocated resources:
```
$ kubectl top po vector-7c66c785bd-wm97r
NAME                      CPU(cores)   MEMORY(bytes)
vector-7c66c785bd-wm97r   1399m        2382Mi
```
and logs: vector-7c66c785bd-wm97r.log
scale Vector-aggregator out to 2 replicas. Horizontal scaling works better than vertical.
since Kubernetes alone cannot balance gRPC traffic, I restarted all 10 Vector-clients so that they init new connections

jszwedko commented 2 years ago

Thanks for reporting this @vpedosyuk . I think the back-pressure behavior is roughly as expected, but it does seem suspicious that your gcp_cloud_storage sink is so slow.

Vector-clients keep more and more file descriptors open even for already deleted logs, which leads to "out of space" problems on our servers. Though it seems to be a known issue.

Vector will hold onto handles to the files until it has finished processing them. Would you want different behavior here? Maybe Vector just releases deleted files even if it hasn't ingested the events from them?

vpedosyuk commented 2 years ago

Maybe Vector just releases deleted files even if it hasn't ingested the events from them?

@jszwedko that would be great. Because we better lose logs than silently break our services. Ideally, it'd be a setting in the file source to switch behavior when needed and a related metric to see how many unfinished files/events get deleted.

jszwedko commented 2 years ago

Thanks for clarifying @vpedosyuk !

I think you can achieve that behavior by changing your sink buffer from on_full = "block" to on_full = "drop_newest". This will cause the sink to simply discard new events once it is full.

vpedosyuk commented 2 years ago

@jszwedko thank you for the amazing suggestion!

This time I couldn't emulate a "slow" upstream to check it, the upstream Vector-aggregator either worked or not. I'll need to play with it a bit more. Though I noticed that Vector-client completely stops reading log files (i.e. "blocks") when the buffer is full and the Vector-aggregator is kind of unresponsive, so I've rarely seen an increase in vector_buffer_discarded_events_total, not sure whether it's expected.

I've also found a good example of parameters that would be very useful in our case: https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-input-filestream.html#filebeat-input-filestream-close-removed

vpedosyuk commented 2 years ago

I also tried to collect a profile of the unresponsive Vector-aggregator because it shows a constant 90% utilization (i.e. the vector_utilization metric) for the gcs sink but no actual processing happens at all: perf.tar.gz

I can't read it but I hope it helps somehow.

vectordotdev / vector