vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.71k stars 1.57k forks source link

is the file tail process bound to a CPU core #3379

Closed bhattchaitanya closed 2 years ago

bhattchaitanya commented 4 years ago

Is the file tail process bound to a CPU core? or does the fail process has ability to use all available CPU cores in the node? The results are not clear. The main issue with most file tailing agents such as fluentd and fluentbit it is that it cannot use multi CPU core, so it doesn't matter if we throw more hardware, it just cannot scale. If vector file tail has that scalability feature then the results should highlight it or we should at least put a disclaimer about this limitation otherwise.

bhattchaitanya commented 4 years ago

Some additional context on this request: the key differentiator of Vector is that it claims to be more reliable (i really hope it does). And probably it is because it is written in RUST a programming language which does not depend on bulky runtimes/interpreter as in the case fluend(ruby). However, I don't see that the benchmark comparison tests highlight enough details on the performance characteristics. Especially, if Vector can process logs in an unordered manner by spawning multiple processes automatically as it is tailing the logs to achieve higher throughput then it makes a huge difference for platforms such as Kubernetes, where 15-20 application pods can be packed in a single large node which leads to higher than 80MBps of logging throughput. The benchmark is especially note clear about this.

lukesteensen commented 4 years ago

Hi @bhattchaitanya! In its current state, Vector will use up to one CPU core per configured file source. We are currently in the planning stages for expanding this to multiple cores per source.

The important thing to note here is "per configured file source". While we recognize that it's not the best UX, it is possible to use more cores right now by splitting the files you want to tail across multiple file sources in your config. This will behave as you describe, spawning multiple independent threads that each process their own files.

As part of our planned upgrades to the file source, we want to make that partitioning step automatic. Then you'd be able to configure a simple single file source and get the performance you're looking for. The challenge is to do that in a way that doesn't take up extra resources for low-volume use cases and balances work evenly for high-volume use cases.

bhattchaitanya commented 4 years ago

Thanks for the suggested workaround; in a Kubernetes node, I have deployed Vector as a daemonset and configured it to watch /var/log/containers/ folder; as you might know, the container stdout logs are dynamic. As and when pods are scheduled and terminated the contents of the /var/log/containers/* will change. Given this dynamic scenario, how do I configure Vector arbitrarily to pick a file as a unique source? Both the number of files and the files themselves change overtime in the node.

jszwedko commented 4 years ago

It looks like vector supports character globbing in the include for the file source so, assuming https://stackoverflow.com/a/47916812 is right and the files are in /var/lib/docker/containers/<container-id>/<container-id>-json.log, one thing you could try would be something like:

include = ["/var/lib/containers/[01234567]*/*.log"]

and

include = ["/var/lib/containers/[89abcdef]*/*.log"]

To partition it into two sets assuming that the first character of the container id is roughly random.

bhattchaitanya commented 4 years ago

thanks guys..will try this out and let you know how it goes.

jszwedko commented 2 years ago

Closing since the question here was answered: yes.