logstash-plugins / logstash-filter-dns

Apache License 2.0
7 stars 28 forks source link

DNS: timeout on resolving address. #40

Closed tomsommer closed 6 years ago

tomsommer commented 6 years ago

After updating to 3.0.7, the plugin starts spamming DNS: timeout on resolving address. {:field=>"host", :value=>"172.20.54.3" errors in the logs after some time.

If I do a manual lookup on the indexer, the IP resolves fine. So it seems like some kind of cache or memory buffer is breaking lookups permanently. It seems after 1 timeout, no new DNS queries are even made (I checked with tcpdump).

If I restart the indexer, the plugin works fine for a while until it breaks again.

Config:

    dns {
                        reverse => ["host"]
                        action => "replace"
                }
jsvd commented 6 years ago

which version of logstash is being used with version 3.0.7?

also, are you pointing the filter to an internal or local dns server? do you have something like dnsmasq installed?

are there any other network related exceptions being logged?

tomsommer commented 6 years ago

logstash-6.2.2-1.noarch

The filter just uses the systems resolvers running powerdns on "remote" internal servers. There are no other exceptions at all.

This problem also clogs the pipeline, since all requests suddenly are marked as timed out and retried until failed (adding +1 second per event).

jake2184 commented 6 years ago

I was seeing seeing this behaviour on 3.0.7 (Logstash 6.2.3).

I have upgraded to v3.0.9, and am still seeing the behaviour - I get many DNS: timeout on resolving address for a valid resolvable IP. This is causing backlogs as logstash cannot process the events as quick as they are coming in (<100 a second).

I suspect if the solution above was to introduce a cache, then that would not solve the problem - I am using this to process proxy logs and have >100 of IPs that are resolved, so I can believe it is overflowing it.

I am also running PDNS servers, so I will investigate there as well.

nithril commented 6 years ago

Same issue with the 3.0.9. It keeps to timeout until I restart logstasg.

nithril commented 6 years ago

Logstash is using JRuby 9.1.13.

Can it relates to this issue: https://github.com/jruby/jruby/commit/ada05fa601c29ff5491fb53434053aa597ccc97b#diff-6d2422fab9324bf20363957b293b3658 ?

It may explain why the DNS resolution is working properly for some time then block.

jsvd commented 6 years ago

very good find @nithril ! Is there any chance you could apply that patch (https://github.com/jruby/jruby/commit/ada05fa601c29ff5491fb53434053aa597ccc97b.patch) and confirm the issue is gone?

[edit] I tested that the patch applies cleanly to logstash 6.2.4:

/tmp/logstash-6.2.4/vendor/jruby % wget https://github.com/jruby/jruby/commit/ada05fa601c29ff5491fb53434053aa597ccc97b.patch
/tmp/logstash-6.2.4/vendor/jruby % patch -p1 < ada05fa601c29ff5491fb53434053aa597ccc97b.patch
patching file lib/ruby/stdlib/resolv.rb
/tmp/logstash-6.2.4/vendor/jruby %
nithril commented 6 years ago

Patch applied. I will be able to check next week.

nithril commented 6 years ago

Unfortunately it does not solve the issue. The FD remains stable, no leak.

jsvd commented 6 years ago

@nithril can you confirm the exact versions that trigger this? we know that >= 3.0.7 on logstash 6.2.2 triggers, but it'd be great if we could narrow this down.

For example, version 3.0.7 didn't add anything new, so the issue isn't coming from it. I'd be great if we, using the same logstsash version, could find a dns filter version < 3.0.7 where the timeouts don't happen.

Also, you may want to know that version 3.0.10 fixes a serious concurrency issue that greatly(!) speeds up the performance when the cache is used.

jake2184 commented 6 years ago

I'm happy to try some older versions, but it's hard to trigger - it seems to start occuring after an indeterminate amount of time.

nithril commented 6 years ago

The dns filter version is the 3.0.9.

All the timeout error comes from Timeout:Error and not ResolvTimeout

The intersting part is that the CPU is stuck at 100% (one core) and the jstack seems to confirm that the process is stuck in the DNS.allocate_request_id loop

jsvd commented 6 years ago

is stuck in the DNS.allocate_request_id loop

if this is the case then I suggest attempting another monkeypatch to resolv.rb to see if the problem goes away. While the commit you mentioned addressed the leak, there were about 4 other commits that touched the structure of resolv.rb and also mention a leak.

@nithril if you could just replace the local logstash resolv.rb file with https://raw.githubusercontent.com/jruby/jruby/ruby-2.5/lib/ruby/stdlib/resolv.rb we'd be able to check if the latest jruby resolv.rb fixes this problem.

Thanks for your patience so far.

nithril commented 6 years ago

The resolution does not work not sure why. Is there a way to display jruby compilation/interpreter error?

colinsurprenant commented 6 years ago

I have yet to be able to reproduce that problem locally. I tried on:

My test testup is this:

What I see is that there are a lot of errors like

 DNS: timeout on resolving address. {:field=>"clientip", :value=>"xxx"}

But I am assuming these are somewhat normal but more importantly at no time do I see the pipeline stop.

@nithril @jake2184 it would be great if you could run the same test with the same data and report what you see here. On my side I will try with LS 5 to see if the same timeouts actually occur.

colinsurprenant commented 6 years ago

I added "5.6.3 with 3.0.6" above, which shows a similar behaviour.

jake2184 commented 6 years ago

The pipeline doesn't stop completely, it's still running and logging failures. The issue is that due to the endless timeouts, it cannot keep up with the incoming data stream. In my case this is only around 20-30 events a second. The queue (I have persistent queue configured for logstash) then grows until I restart logstash.

It's worth noting that I don't believe this is caused by an actual DNS timeout, as hinted at by @nithril . We run a high availability internal DNS system, and are only resolving internal IP addresses. If we had anywhere near as many 'timeouts' as Logstash is recording we would be having many other problems all over our system.

colinsurprenant commented 6 years ago

@jake2184 thanks for the clarifications - I was in fact looking for a pipeline stall. So, concentrating on the timeouts only, note that I basically saw the same behaviour between LS 5.6.3+DNS 3.0.6 and LS 6.2.2+DNS 3.0.7/3.0.9.

To help correctly diagnose and/or test for potential solutions I would like to find a combination that does not show the timeout behaviour using my minimal setup + data as shared above.

Do you know of a LS+DNS combination that was known to be working correctly and could you run the above config+data to verify on your end that it behave the same? (I will test it too locally but knowing it produces the same result on your side will help)

jake2184 commented 6 years ago

I am also getting 'DNS: timeout on resolving address' when DNS returns NXDOMAIN. Does this mean it isn't accepting NXDOMAIN as a valid response, so retries?

Apologies for not being too helpful with the debugging, I've had a look at the Ruby source for the filter & resolv library, but I don't know Ruby so can't understand if there are any problems.

colinsurprenant commented 6 years ago

@jake2184 I am trying locally and NSDOMAIN is handled correcty; it does not result in a timeout.

You really don't need to know Ruby to help here; the important thing for me at this point is to be able to reproduce the problem and reproduce when the problem does no occur. For that I'll repeat my question:

Do you know of a LS+DNS combination that was known to be working correctly and could you run the above config+data to verify on your end that it behave the same? (I will test it too locally but knowing it produces the same result on your side will help)

jake2184 commented 6 years ago

I've taken a bit of time to set up a test environment - sorry, I should have done this earlier when you're so willing to help.

I am using my own config file and data, editing the input/output to be stdin, and removing the DNS filter cache parameters. I am now able to reliably cause the problem. As indicated in the above bug report, I am almost certain it is that something is not being freed in resolv.rb. I am able to process just under 65536 events, before the the pipeline doesn't process any more and prints error messages. The exact number (of dots from the output codec, I assume accurate) is around 64100 - 64400, so roughly 65536 - 1024?

I have tested the following versions: LS 5.6.3 + DNS 3.0.6 - No problem LS 6.2.1 + DNS 3.0.7 - Problem LS 6.2.3 + DNS 3.0.7/3.0.9 - Problem

Manual fixes I have attempted to LS 6.2.3 and DNS 3.0.9 that have not worked:

jake2184 commented 6 years ago

Have had a quick look at the Apache log you are testing. It has <40,000 unique IP addresses, so it's possible you're not doing enough DNS lookups to trigger the error, if you have any form of caching enabled?

colinsurprenant commented 6 years ago

@jake2184 thanks a lot for your followup! This is super useful.

It has <40,000 unique IP addresses, so it's possible you're not doing enough DNS lookups to trigger the error

Oh, this is interesting. ok, I will build a > 64k unique IP addresses file and use that and report back my finding.

if you have any form of caching enabled?

Not at the filter level - but I locally installed dnsmasq using default configuration which I assume does caching but I guess that should not be relevant since the dns filter still has to call the resolver which will hit dnsmasq.

colinsurprenant commented 6 years ago

This is really mysterious. I built a file with ~3M unique IPs, one per line and ran this test:

$ cat .uniq_ips.txt | bin/logstash -e 'filter{dns{reverse=>["message"]}} output{stdout{codec=>dots}}'

and at the same time measured the pipeline throughput using the measure_performance.rb script in https://github.com/elastic/logstash-benchmark-tools/tree/master/pq_blog and observed the following:

All my tests processed a few 100ks of events to make sure I am well over the 64k mark. Note that this config above does not use any caching.

This suggest that sometime else is really going on. Is it my usage of dnsmasq locally? Could it be OS related? (I am running on a MBP + OSX 10.12.6), could it be the JVM version? (I am running 1.8.0_144).

It could also be caching related?? I will try another testing round with extreme caching enabled to make sure those caches grow huge. something like:

hit_cache_size => 100000
hit_cache_ttl => 10000
failed_cache_size => 100000
failed_cache_ttl => 10000
colinsurprenant commented 6 years ago

damn - I just realized that my IP dedupping was wrong and my IPs are not unique. fixing that and re-running tests.

colinsurprenant commented 6 years ago

My test file now contains 68648 unique IPs.

My test config is :

cat ips.txt | bin/logstash -e 'filter{dns{reverse=>["message"] hit_cache_size => 0  failed_cache_size => 0}} output{stdout{codec=>dots}}'

So basically this is without any caching.

Overall the EPS differences here is not really important in the sense that overall the same behaviour was witnessed in all these scenarios: the pipeline was steadily processing with many periods of 0 EPS because of the DNS timeouts but the pipeline never completely stalled.

So my conclusion is similar to the one above: I can't seem to reproduce the stalling as something happening after some time. @jake2184 you mentioned earlier that what you witness is not a complete stall but a heavy slowdown at some point. It seems that the numbers you gave are similar with the numbers I see in my tests. What confuses me is that it seems that all my tests are actually behaving the way you describe your problem but for me it doesn't happen after some time, it is steadily very slow.

I can't reproduce a scenario that works fast for a while. Is that normal because many of my IPs are in fact "normally" timing out? should I try with only IPs known to be valid? I will check that.

jake2184 commented 6 years ago

I ran my same tests again (LS 6.2.3 + DNS 3.0.9), and measured using the script above. The output was: output.txt

As you can see, I can rapidly do many DNS lookups for the first 60K ish events, and then there is a drastic drop-off. It looks like it does still process events in intermittent bursts, perhaps as whatever resource is tied-up is freed? However it can process 0 events for 30 seconds..

Version Info: Centos 7.2 OpenJDK Java 1.8.0_111

I don't think it's an individual system config problem, as I see the issue on our production system and the test one I am running. I get the feeling it is going to be tricky for you to help much if you cannot generate the problem! Very grateful for any and all time spent assisting.

colinsurprenant commented 6 years ago

I figured last night how to test that - stay tuned 🤔

colinsurprenant commented 6 years ago

BOOM đź’Ą I can reproduce.

Here's what I did:

This setup allows me to be generate an exact number of always resolvable IP addresses.

The result: after EXACTLY 65534 unique address resolutions, the resolver systematically times out.

Now that I have a reproduction setup, let's work on the solution. I will now officially re-open this issue.

colinsurprenant commented 6 years ago

Good news.

I can reproduce this problem outside of logstash using my dnsmasq setup above and this simple script:

require "resolv"
require "ipaddr"

resolv = Resolv.new
ip = IPAddr.new("10.0.0.1")

i = 0
loop  do
  puts("#{i}, #{ip.to_s}, #{resolv.getname(ip.to_s)}")
  ip = ip.succ
  i += 1
end

It fails on JRuby versions 9.1.13.0 to 9.1.15.0 and is fixed in 9.1.16.0.

As suggested by @jsvd above the fix IS indeed this patch as seen in the jruby diff on lib/ruby/stdlib/resolv.rb:

lib/ruby/stdlib/resolv.rb
@@ -776,7 +776,7 @@ def sender(msg, data, host, port=Port)
           sock = @socks_hash[host.index(':') ? "::" : "0.0.0.0"]
           return nil if !sock
           service = [IPAddr.new(host), port]
-          id = DNS.allocate_request_id(host, port)
+          id = DNS.allocate_request_id(service[0], service[1])
           request = msg.encode
           request[0,2] = [id].pack('n')
           return @senders[[service, id]] =

⚠️ Now, this patch cannot be applied on vendor/jruby/lib/ruby/stdlib/resolv.rb because this is not the JRuby classes used in the logstash runtime, these are coming from logstash-core/lib/jars/jruby-complete-9.1.13.0.jar

The correct fix will be to upgrade to a JRuby version >= 9.1.16.0 but until we can do that I will probably be looking at creating a monkey patch on resolv.rb.

nithril commented 6 years ago

Now, this patch cannot be applied on vendor/jruby/lib/ruby/stdlib/resolv.rb because this is not the JRuby classes used in the logstash runtime, these are coming from logstash-core/lib/jars/jruby-complete-9.1.13.0.jar

All this effort just because we patch the wrong file. Thank you @colinsurprenant :+1:

colinsurprenant commented 6 years ago

submitted #45 and also elastic/logstash/issues/9750

colinsurprenant commented 6 years ago

@nithril yeah really - I wish I had clued in earlier. At least now we have a solid setup to reproduce and stress test any scenarios with this filter and that also gave me an idea on how to make the input generator plugin more flexible.

jake2184 commented 6 years ago

Brilliant! Thanks for all your help + time.

When can we expect the pull request to be merged and released (for the plugin)?

jsvd commented 6 years ago

@jake2184 the PR is being reviewed, progress is visible there

colinsurprenant commented 6 years ago

logstash-filter-dns v3.0.11 is published. @jake2184 @nithril would love to know if that works for you.

nithril commented 6 years ago

Thank you very much!

I will be able to test in 2-4 weeks :/

On Mon, Jun 18, 2018 at 7:16 PM Colin Surprenant notifications@github.com wrote:

logstash-filter-dns v3.0.11 is published. @jake2184 https://github.com/jake2184 @nithril https://github.com/nithril would love to know if that works for you.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/logstash-plugins/logstash-filter-dns/issues/40#issuecomment-398129139, or mute the thread https://github.com/notifications/unsubscribe-auth/AAjnU3g1DwQ3wW_mm5Kw7QbPUgQDTFwxks5t9-CHgaJpZM4Siara .

-- LilActu.fr Agrégateur de News http://lilactu.fr

jake2184 commented 6 years ago

Can confirm this fixes my test, I can happily resolve 5K+ events a second :)

Thanks very much

colinsurprenant commented 6 years ago

Great stuff! Thanks for the update @jake2184

berglh commented 5 years ago

@colinsurprenant I'm seeing this behaviour on logstash-filter-dns v3.0.12 in Logstash 6.7.1. Here are my DNS filter configurations, there are a few more instances of the plugin running in other config files.

2-general/400000-ad-smb.conf:      dns {
2-general/400000-ad-smb.conf-        resolve => ["[@metadata][destination][address]"]
2-general/400000-ad-smb.conf-        action => "replace"
2-general/400000-ad-smb.conf-        failed_cache_size => 1000
2-general/400000-ad-smb.conf-        failed_cache_ttl => 60
2-general/400000-ad-smb.conf-        hit_cache_size => 10000
2-general/400000-ad-smb.conf-        hit_cache_ttl => 300
2-general/400000-ad-smb.conf-      }
--
2-general/400000-ad-smb.conf:            dns {
2-general/400000-ad-smb.conf-              reverse => ["[@metadata][ip]"]
2-general/400000-ad-smb.conf-              action => "replace"
2-general/400000-ad-smb.conf-              failed_cache_size => 1000
2-general/400000-ad-smb.conf-              failed_cache_ttl => 60
2-general/400000-ad-smb.conf-              hit_cache_size => 10000
2-general/400000-ad-smb.conf-              hit_cache_ttl => 300
2-general/400000-ad-smb.conf-            }

I'm wondering if the multiple instances are somehow conflicting. After a certain time, almost all the lookups timeout slowing down the pipelines significantly. I'm wondering if either the hit cache is getting so large that the timeout of 500ms (default) is being hit before actually trying to hit our resolving name servers. Anyway, I'm going to try down-grading to logstash-filter-dns v3.0.11 to see if I can replicate it there. Another option would be to avoid using the caches altogether.

colinsurprenant commented 5 years ago

@berglh thanks for the heads-up. In fact we've seen other reports of this which seems like a regression in 6.7+. I am trying to reproduce but without success so far. It does not look like it is necessarily related to the same 64k problem previously solved here.

Will report back when I find more.

berglh commented 5 years ago

@colinsurprenant I've tested some more this morning and I tried the following:

The behaviour continued in the first two on LS 6.7.1. Things seem stable again back at LS 6.5.1, but I wonder if the random crash that was fixed in newer version of the plugin might be what caused the occasional issue with my pipeline completely stopping (without a noticeable error that I could see).

Some other notes about my environment:

Do you want me to open an issue or comment an an open one?

colinsurprenant commented 5 years ago

@berglh Thanks for the followup.

I believe I found the source of the problem. I will need to validate tomorrow but there seems to be a leak (a request id cache fills up which lead to an infinite loop that is aborted by the timeout) in the new version of the resolver code that was updated starting with 6.7. For now, as you noticed, the only workaround is to downgrade to 6.6 (6.6.2 as of now) and just downgrading the dns filter will not help.

colinsurprenant commented 5 years ago

@berglh Yeah, we should open a new issue and followup there...