Large amount files will lead to slow throughput rate to kafka

WilliamEricCheung commented 1 year ago

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

Background Hello, we are migrating services logs using vector (1 host) -- kafka (3 hosts) -- logstash (3 hosts) -- opensearch (1 kibana host) pipeline. Our services will output almost 25 gzip files per hour in certain directory, each size is about 900M, and we have a script to remove 1 day ago deprecated logs from this directory to reduce storage pressure. When we start vector at Nov 16, 2022 @ 9:00:00, the throughput was steady until 12:00:00, during this 3 hour, we could get about 450M indice hits per hour in kibana. However, the hits rate rapidly down after 12 PM, we just got 100M at 3 PM, 80M at 6 PM, 70M at 9 PM, 60 M at Nov 17, 2022 0 AM, 50M at 3 AM and 40M at 6 AM.

Our thinking Every hour our services put 25 files, for one whole day, we will have 2425=600 files. Even we use a script to keep just one day logs, the file total size is very heavy here - 600 900M = 540G.

Question

Is any source constraint for vector? file number? file total size?
Should we change our script to just keep 12 hrs our 6 hrs logs to reduce the vector pressure? What if this condition do not help?

Basic Infomation Sources format: gzip files Files directory: /local/vectorLog Files format: /local/vectorLog/requests.log.yyyy-MM-dd-HH.gz Files size: ~900M

Configuration

[sources.gz_logs]
type = "file"
max_line_bytes = 2024000
ignore_older_secs = 600
include = [ "/local/vectorLog/requests*.gz" ]
multiline.mode = "halt_with"
multiline.condition_pattern = "---------------------------------------------*"
multiline.timeout_ms = 5000
multiline.start_pattern = '^\{'
# Parse gz_logs
[transforms.parse_logs]
type = "remap"
inputs = ["gz_logs"]
source = """
lines,err = split(.message, "\n")
.message = lines[0] + lines[1]
.message = to_string(.message)
.message = parse_json!(.message)
"""
# Print parsed logs to stdout
[sinks.print]
type = "kafka"
inputs = [ "parse_logs" ]
bootstrap_servers = "AAA:9092,BBB:9092,CCC:9092"
topic = "TOPIC"
compression = "none"
 [sinks.print.encoding]
  codec = "json"

Version

vector 0.23.3 (x86_64-unknown-linux-gnu af8c9e1 2022-08-10)

Debug Output

### 96 log files: continue source transfrom sink
2022-11-16T09:19:38.546454Z DEBUG source{component_kind="source" component_id=gz_logs component_type=file component_name=gz_logs}:file_server: file_source::file_server: event_throughput=230.000/sec bytes_throughput=1.283m/sec ratios={"discovery": 0.9952969, "other": 0.0002865836, "reading": 0.0043958127, "sending": 1.8007844e-5, "sleeping": 2.8511367e-6}
2022-11-16T09:19:38.549660Z DEBUG transform{component_kind="transform" component_id=parse_logs component_type=remap component_name=parse_logs}: vector::utilization: utilization=0.028800348759819846
2022-11-16T09:19:39.471776Z DEBUG vector::internal_events::file::source: Files checkpointed. count=96 duration_ms=2
2022-11-16T09:19:39.797520Z DEBUG source{component_kind="source" component_id=gz_logs component_type=file component_name=gz_logs}:file_server: file_source::file_server: event_throughput=288.000/sec bytes_throughput=1.504m/sec ratios={"discovery": 0.9959729, "other": 0.000243444, "reading": 0.0037625295, "sending": 1.9564282e-5, "sleeping": 2.7171284e-6}
2022-11-16T09:19:40.487837Z DEBUG vector::internal_events::file::source: Files checkpointed. count=96 duration_ms=4
2022-11-16T09:19:40.645823Z DEBUG sink{component_kind="sink" component_id=print component_type=kafka component_name=print}: vector::utilization: utilization=0.013442464730048877

### 287 log files: continue source transform sink
2022-11-16T17:45:30.182521Z DEBUG source{component_kind="source" component_id=gz_logs component_type=file component_name=gz_logs}:file_server: file_source::file_server: event_throughput=450.000/sec bytes_throughput=2.360m/sec ratios={"discovery": 0.9924744, "other": 0.00024555708, "reading": 0.007268966, "sending": 9.853384e-6, "sleeping": 1.3856921e-6}
2022-11-16T17:45:30.191391Z DEBUG transform{component_kind="transform" component_id=parse_logs component_type=remap component_name=parse_logs}: vector::utilization: utilization=0.0444073165530334
2022-11-16T17:45:30.472922Z DEBUG vector::internal_events::file::source: Files checkpointed. count=287 duration_ms=1
2022-11-16T17:45:30.647038Z DEBUG sink{component_kind="sink" component_id=print component_type=kafka component_name=print}: vector::utilization: utilization=0.023487958650561292
2022-11-16T17:45:31.479022Z DEBUG vector::internal_events::file::source: Files checkpointed. count=287 duration_ms=0

### 477 log files: looks like not working
2022-11-17T02:30:46.375905Z DEBUG source{component_kind="source" component_id=gz_logs component_type=file component_name=gz_logs}:file_server: file_source::file_server: event_throughput=9.267k/sec bytes_throughput=48.290m/sec ratios={"discovery": 0.831037, "other": 0.0038224799, "reading": 0.16500683, "sending": 0.00011769667, "sleeping": 1.6122965e-5}
2022-11-17T02:30:47.015750Z DEBUG vector::internal_events::file::source: Files checkpointed. count=477 duration_ms=12
2022-11-17T02:30:47.405331Z DEBUG source{component_kind="source" component_id=gz_logs component_type=file component_name=gz_logs}:file_server: file_source::file_server: event_throughput=8.586k/sec bytes_throughput=44.756m/sec ratios={"discovery": 0.8268399, "other": 0.0035622944, "reading": 0.16947487, "sending": 0.00010818543, "sleeping": 1.4872786e-5}
2022-11-17T02:30:48.027768Z DEBUG vector::internal_events::file::source: Files checkpointed. count=477 duration_ms=2
2022-11-17T02:30:48.418591Z DEBUG source{component_kind="source" component_id=gz_logs component_type=file component_name=gz_logs}:file_server: file_source::file_server: event_throughput=8.692k/sec bytes_throughput=45.377m/sec ratios={"discovery": 0.8270361, "other": 0.0035936986, "reading": 0.16924557, "sending": 0.0001098454, "sleeping": 1.4949319e-5}

Example Data

No response

Additional Context

No response

References

No response

WilliamEricCheung commented 1 year ago

Since no one comments on my post, we tried to modify our scripts to just keep almost 100 logs within 4 hours. Now the vector can run smoothly, but it's not the solution we want for our business.

mhenniges commented 1 year ago

I see somewhat similar behavior with simple file -> s3 pipeline as well. Anybody from vector have advice here?

pront commented 11 months ago

Since no one comments on my post, we tried to modify our scripts to just keep almost 100 logs within 4 hours. Now the vector can run smoothly, but it's not the solution we want for our business.

Hi @WilliamEricCheung, and my apologies for not providing a reply sooner. Since this seems like a complex issue we will need to dive deeper to determine if there is actual bottleneck (bug that can be fixed) or this is a case where scaling up is the solution.

As an easy first step, I suggest adding an internal metrics source and sharing the metrics here.

vectordotdev / vector