Open tungstenostrich opened 9 years ago
I'm also experiencing problems with this plugin, but I'm not getting any error messages. Did you enable a debug flag or something? I was able to confirm my problem comes from this plugin since when I remove it everything works fine. It seems to not handle high load very well.
Looking at the code, it's using resolv
which seems to be a pure Ruby DNS client. That probably explain why it's unaffected by the JVM DNS settings (it doesn't call the Java APIs for DNS resolution) nor nscd
(it doesn't use the libc
resolver).
I'm not sure what a good way forward would be. Perhaps use a DNS proxy-cache (if such a thing exists) or implement a cache inside the logstash plugin itself? It would work well for my use case where I use this plugin only to resolve my backend server names and I have only a hundred of them or so. It wouldn't work that well for resolving public IP addresses (from web logs for example) though...
The error messages are only visible when running in debug mode. I modified the source code to log those statements at higher log levels so they appear during non debug mode. I'm going to try using dnsmasq as a dns cache on the local machine. I'll post my experiences here soon.
After a couple of hours running logstash dns filter with dnsmasq caching the socket error rate has come down a little bit but remains on a high level which is far from satisfactory.
Thanks for the update.
I'm not very familiar with Ruby but it seems the code is catching an exception when that happens. Is there any way to get more details from the exception, the e
variable in the code:
rescue SocketError => e
@logger.debug("DNS: Encountered SocketError.",
:field => field, :value => raw)
return
A stacktrace perhaps, or something else? Would e.backtrace
work?
I think we need to try to collect more info to pinpoint the error cause, and/or narrow down where the bottleneck is.
We managed to almost eradicate the Socket errors from several hundred thousands to some hundred per 24 hours by using dnsmasq as the DNS cache on the logstash machine. Our /etc/resolv.conf only contains 127.0.0.1 (the locally runnind dnsmasq) as a nameserver entry, nscd does no DNS caching at all. In the dns filter configuration in logstash we use the option nameserver which points to 127.0.0.1 as well.
Our tests with dnsmasq showed much better response times compared to nscd and, not surprisingly, to the plain resolver without any cache. Furthermore failure handling in case a nameserver is no longer available is much smoother (and consistent between the scenarios a) nameserver as hole is not available and b) only the DNS service is not available) with dnsmasq than with the two other approaches.
All this let us believe that the ruby DNS implementation is somewhat pretty picky about response times from the underlying DNS resolve system.
would you be able to share your dnsmasq configuration? I am facing a performance issue with reverse DNS lookups, as mentioned in https://discuss.elastic.co/t/dns-filter-performance-issue/2544
@yuphing-ong, I use default configuration of dnsmasq (from CentOS 7 rpm package). It uses NSs from resolv.conf and listen on 127.0.0.1:53. Of cource you have to setup 127.0.0.1 as name server in DNS filter properties.
thanks @dsvetlov . I did more testing over the weekend and it's the reverse DNS lookup for external hosts that is slowing everything down, so I will just implement for internal/local hosts, and perhaps work on a batch update for external hosts once/day.
Our dnsmasq/logstash configuration is as follows:
/etc/dnsmasq.conf: no-hosts no-resolv no-negcache strict-order server=IP_NS1 server=IP_NS2 listen-address=127.0.0.1 cache-size=300
logstash.conf: ... dns { action => 'replace' timeout => 60 nameserver => '127.0.0.1' reverse => ['host_fqdn'] add_tag => ['dns_reverse_lookup'] } ...
@jsvd any reecommendation for this? Adding timeout should work?
Looks that the way we do lookups is not quite performant?
I've done some investigation on this and it seems the jruby's implementation of Resolv is somewhat broken, you can see more about this in https://github.com/jruby/jruby/issues/3659
We are using the dns filter in logstash quite heavily to do a reverse lookup for the IP of every incoming log message. We get a lot of "DNS: Encountered SocketError" errors. The SocketError object contains the message "bind: name or service not known".
The IPs the reverse lookups are done for are all properly registered in our DNS and most of the time the dns filter can successfully make the reverse lookup. Running tcpdump I could not find any problems on the network layer while the SocketError in the dns filter appeared.
The volume of messages and hence of reverse lookups per 24h is roughly 7.3 millions, of which are 700 thousand socket errors.
Some more information about our environment:
Enterprise Linux 6.6 with nscd running. Turning nscd on/off doesn't change anything. We also changed the JVM dns caching from completely off to infinite. No effects either.