The dns filter (dns.rb) produces a lot of SocketError messages for reverse lookups

tungstenostrich commented 9 years ago

We are using the dns filter in logstash quite heavily to do a reverse lookup for the IP of every incoming log message. We get a lot of "DNS: Encountered SocketError" errors. The SocketError object contains the message "bind: name or service not known".

The IPs the reverse lookups are done for are all properly registered in our DNS and most of the time the dns filter can successfully make the reverse lookup. Running tcpdump I could not find any problems on the network layer while the SocketError in the dns filter appeared.

The volume of messages and hence of reverse lookups per 24h is roughly 7.3 millions, of which are 700 thousand socket errors.

Some more information about our environment:

Enterprise Linux 6.6 with nscd running. Turning nscd on/off doesn't change anything. We also changed the JVM dns caching from completely off to infinite. No effects either.

nguillaumin commented 9 years ago

I'm also experiencing problems with this plugin, but I'm not getting any error messages. Did you enable a debug flag or something? I was able to confirm my problem comes from this plugin since when I remove it everything works fine. It seems to not handle high load very well.

Looking at the code, it's using resolv which seems to be a pure Ruby DNS client. That probably explain why it's unaffected by the JVM DNS settings (it doesn't call the Java APIs for DNS resolution) nor nscd (it doesn't use the libc resolver).

I'm not sure what a good way forward would be. Perhaps use a DNS proxy-cache (if such a thing exists) or implement a cache inside the logstash plugin itself? It would work well for my use case where I use this plugin only to resolve my backend server names and I have only a hundred of them or so. It wouldn't work that well for resolving public IP addresses (from web logs for example) though...

tungstenostrich commented 9 years ago

The error messages are only visible when running in debug mode. I modified the source code to log those statements at higher log levels so they appear during non debug mode. I'm going to try using dnsmasq as a dns cache on the local machine. I'll post my experiences here soon.

tungstenostrich commented 9 years ago

After a couple of hours running logstash dns filter with dnsmasq caching the socket error rate has come down a little bit but remains on a high level which is far from satisfactory.

nguillaumin commented 9 years ago

Thanks for the update.

I'm not very familiar with Ruby but it seems the code is catching an exception when that happens. Is there any way to get more details from the exception, the e variable in the code:

      rescue SocketError => e
        @logger.debug("DNS: Encountered SocketError.",
                      :field => field, :value => raw)
        return

A stacktrace perhaps, or something else? Would e.backtrace work?

I think we need to try to collect more info to pinpoint the error cause, and/or narrow down where the bottleneck is.

tungstenostrich commented 9 years ago

We managed to almost eradicate the Socket errors from several hundred thousands to some hundred per 24 hours by using dnsmasq as the DNS cache on the logstash machine. Our /etc/resolv.conf only contains 127.0.0.1 (the locally runnind dnsmasq) as a nameserver entry, nscd does no DNS caching at all. In the dns filter configuration in logstash we use the option nameserver which points to 127.0.0.1 as well.

Our tests with dnsmasq showed much better response times compared to nscd and, not surprisingly, to the plain resolver without any cache. Furthermore failure handling in case a nameserver is no longer available is much smoother (and consistent between the scenarios a) nameserver as hole is not available and b) only the DNS service is not available) with dnsmasq than with the two other approaches.

All this let us believe that the ruby DNS implementation is somewhat pretty picky about response times from the underlying DNS resolve system.

yuphing-ong commented 9 years ago

would you be able to share your dnsmasq configuration? I am facing a performance issue with reverse DNS lookups, as mentioned in https://discuss.elastic.co/t/dns-filter-performance-issue/2544

dsvetlov commented 9 years ago

@yuphing-ong, I use default configuration of dnsmasq (from CentOS 7 rpm package). It uses NSs from resolv.conf and listen on 127.0.0.1:53. Of cource you have to setup 127.0.0.1 as name server in DNS filter properties.

yuphing-ong commented 9 years ago

thanks @dsvetlov . I did more testing over the weekend and it's the reverse DNS lookup for external hosts that is slowing everything down, so I will just implement for internal/local hosts, and perhaps work on a batch update for external hosts once/day.

tungstenostrich commented 9 years ago

Our dnsmasq/logstash configuration is as follows:

/etc/dnsmasq.conf: no-hosts no-resolv no-negcache strict-order server=IP_NS1 server=IP_NS2 listen-address=127.0.0.1 cache-size=300

logstash.conf: ... dns { action => 'replace' timeout => 60 nameserver => '127.0.0.1' reverse => ['host_fqdn'] add_tag => ['dns_reverse_lookup'] } ...

gmoskovicz commented 8 years ago

@jsvd any reecommendation for this? Adding timeout should work?

Looks that the way we do lookups is not quite performant?

jsvd commented 8 years ago

I've done some investigation on this and it seems the jruby's implementation of Resolv is somewhat broken, you can see more about this in https://github.com/jruby/jruby/issues/3659

logstash-plugins / logstash-filter-dns

The dns filter (dns.rb) produces a lot of SocketError messages for reverse lookups #6