Open vpedosyuk opened 3 years ago
Thanks for reporting this @vpedosyuk . I think the back-pressure behavior is roughly as expected, but it does seem suspicious that your gcp_cloud_storage
sink is so slow.
Vector-clients keep more and more file descriptors open even for already deleted logs, which leads to "out of space" problems on our servers. Though it seems to be a known issue.
Vector will hold onto handles to the files until it has finished processing them. Would you want different behavior here? Maybe Vector just releases deleted files even if it hasn't ingested the events from them?
Maybe Vector just releases deleted files even if it hasn't ingested the events from them?
@jszwedko that would be great. Because we better lose logs than silently break our services. Ideally, it'd be a setting in the file
source to switch behavior when needed and a related metric to see how many unfinished files/events get deleted.
Thanks for clarifying @vpedosyuk !
I think you can achieve that behavior by changing your sink buffer from on_full = "block"
to on_full = "drop_newest"
. This will cause the sink to simply discard new events once it is full.
@jszwedko thank you for the amazing suggestion!
This time I couldn't emulate a "slow" upstream to check it, the upstream Vector-aggregator either worked or not. I'll need to play with it a bit more. Though I noticed that Vector-client completely stops reading log files (i.e. "blocks") when the buffer is full and the Vector-aggregator is kind of unresponsive, so I've rarely seen an increase in vector_buffer_discarded_events_total
, not sure whether it's expected.
I've also found a good example of parameters that would be very useful in our case: https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-input-filestream.html#filebeat-input-filestream-close-removed
I also tried to collect a profile of the unresponsive Vector-aggregator because it shows a constant 90% utilization (i.e. the vector_utilization
metric) for the gcs
sink but no actual processing happens at all: perf.tar.gz
I can't read it but I hope it helps somehow.
Vector Version
Vector clients:
Vector aggregator:
though the same was happening with the
0.15.1
versionVector Configuration File
Vector client
Vector aggregator:
Expected Behavior
Actual Behavior
Context
We use Vector->Vector-aggregator scheme. Once the Vector-aggregator receives data from the Vector-clients, it sends logs onto a GCS bucket. There're 10 Vector clients (bare metal servers) and 1 Vector aggregator in Kubernetes.
There're 2 issues we faced:
vector_utilization
metric reports almost 100% utilization for thegcp_cloud_storage
sink on Vector-aggregator and for thevector
sink on Vector-clients, however, a very small amount of data (or no data) is actually get transmitted.Details
I tried to reproduce the issue and it turned out to be quite similar to what we've seen before.
I marked moments when I took an action:
Taken actions:
restart all 10 Vectors-clients, for some reason this triggered the issue, maybe due to a somewhat increased data rate after a Vector restart
increased pod's
limits
(2 cpu
,4Gi mem
->3 cpu
,5Gi mem
) of the Vector-aggregator and did a pod restart. Actually, even before this change, it wasn't using all the resources.but Vector-aggregator still didn't want to handle all the data it was receiving, it wasn't using all the allocated resources:
and logs: vector-7c66c785bd-wm97r.log
scale Vector-aggregator out to 2 replicas. Horizontal scaling works better than vertical.
since Kubernetes alone cannot balance gRPC traffic, I restarted all 10 Vector-clients so that they init new connections