vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
18.13k stars 1.6k forks source link

Vector Lookup address to DNS even if TTL is higher #21450

Open manavadariakevin opened 1 month ago

manavadariakevin commented 1 month ago

A note for the community

Problem

We are using Vector version vector-0.40.0-1.x86_64 in our linux setup where we have below configuration to send logs to vector aggregators and the endpoint is on Envoy Proxy.

sinks: vector: type: vector

healthcheck: False

 address: "https://vector-nonprod.abc.com"
 compression: True
 inputs:
   - parsing
   - nginx
 batch:
   max_bytes: 10000
   max_events: 10000
 buffer:
   type: "disk"
   max_size: 268435488
 request:
   rate_limit_num: 30
   retry_attempts: 100
   timeout_secs: 5
   retry_max_duration_secs: 5
   retry_initial_backoff_secs: 1
   retry_jitter_mode: Full

it keeps connecting to DNS for lookup for vector-nonprod.abc.com all the time and it is making too much query to DNS while it should use the DNS caching itself or use server resolv configuration to get the data instead of going directly to DNS.

Here are some connections towards our DNS server and this is just for nonprod , but for prod we have something like 500 connections towards DNS and 300 something queries per minute towards DNS. this is affecting our DNS badly with too many requests. If there is any solution to make this work please guide.

netstat -n | grep 254 udp 0 0 10.10.10.17:28174 10.10.10.254:53 ESTABLISHED udp 0 0 10.10.10.17:36843 10.10.10.254:53 ESTABLISHED udp 0 0 10.10.10.17:47618 10.10.10.254:53 ESTABLISHED udp 0 0 10.10.10.17:59961 10.10.10.254:53 ESTABLISHED

Configuration

sinks:
  vector:
     type: vector
       #healthcheck: False
     address: "https://vector-nonprod.abc.com:443"
     compression: True
     inputs:
       - parsing
       - nginx
     batch:
       max_bytes: 10000
       max_events: 10000
     buffer:
       type: "disk"
       max_size: 268435488
     request:
       rate_limit_num: 30
       retry_attempts: 100
       timeout_secs: 5
       retry_max_duration_secs: 5
       retry_initial_backoff_secs: 1
       retry_jitter_mode: Full

Version

vector 0.40.0 (x86_64-unknown-linux-gnu 1167aa9 2024-07-29 15:08:44.028365803)

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

manavadariakevin commented 1 month ago

Also have used below combination in address "https://vector-nonprod.abc.com:443"

still it is the same.

jszwedko commented 1 month ago

I think we discussed this in Discord a bit. I mentioned there that Vector does a DNS lookup every time in initiates a connection. However, even given that, it seems like you are seeing many more lookups than might be expected (it seems unlikely, but maybe possible?, that Vector is initiating 500 connections per second).

Regardless, it does seem prudent for Vector to do DNS caching so I think adding that would be a reasonable way to address this issue.

killkill commented 1 month ago

Yes , I meet the same problem my config:

_[sinks.out] type = "loki" inputs = [ "remove_kafka_fields" ] endpoint = "http://distributor-loki.my.com/" out_of_order_action = "accept" remove_timestamp = true tenantid = "myapp"

use tcpdump to watch: tcpdump -vvn port 53

so many dns resolution ;

manavadariakevin commented 4 days ago

Just want to add more here we also using Splunk HEC as sink and it is also having the similar issue and we see too many DNS connections and queries being done which is heavy on DNS setup. Would be good if we get some fix for this.

pront commented 4 days ago

Hello, we don't have the capacity to get to this right now. We always welcome PRs and we do our best to review them ASAP.

Regardless, it does seem prudent for Vector to do DNS caching so I think adding that would be a reasonable way to address this issue.

In this instance, the solution Jesse mentioned seems like the best way to fix this issue. I would start looking at https://github.com/vectordotdev/vector/blob/master/src/dns.rs, potentially introducing a caching layer there. We might also want to expose some new config options for this, such as turning caching on/off and TTL for cache entries.