Open purbon opened 8 years ago
I don't yet agree fully with the problem description.
Your sample code is clear, thank you for that! However, my main concern is that the dns filter is interrupted at all -- what causes this?
Some notes from reading the code: seems like Mutex#synchronize
is being interrupted. Interrupting a thread in a critical section seems like a bad practice. Anyway, the exception "/opt/logstash/vendor/jruby/lib/ruby/1.9/resolv.rb:190:in 'lazy_initialize'",
appears to be while the resolv library is waiting for ownership of the mutex. The exception "org/jruby/ext/timeout/Timeout.java:115:in 'timeout'",
tells me we are using the timeout library to interrupt the dns lookup after an expiration time. Indeed, the Timeout
library is used in the dns filter.
The problem I see is that we are using the Timeout
library to abort lookups that take too long. This timeout interrupts the lookup thread and, I believe, results in this ConcurrencyError: interruptedwaitingformutex: null
error.
Some possible solutions I think of:
Timeout
library and signal timeouts with another mechanismConcurrencyError
and have the code understand that the exception is a timeout if the exception message includes "interruptedwaitingformutex"that is why I love discussion 👍, I agree timeout could be the source of getting this thread killed, but problem is more tricky to debug I guess. Even with very small timeout value we did not manage to reproduce this issue with a logstash pipeline.
I have a an open question:
I agree your points are fair also as solutions, but have also a few open questions:
re: avoid using timeout:
re : rescue ConcurrencyError
This is basically why I when for the mutex to "protect" the resolve operation to be threadsafe.
I think in case we decide to remove the mutex and go for one of your proposals I would think more about 2.
, however I would like to rollout possible masquerading.
What do you think?
we still need somehow a timeout mechanism
The Resolv library appears to support timeouts for DNS lookups: http://ruby-doc.org/stdlib-2.1.0/libdoc/resolv/rdoc/Resolv/DNS.html#method-i-timeouts-3D
@jordansissel this is a very good idea, will update my proposal with this.
I reviewed the code and I wonder how are we with the move to use the 2.0 codebase, I remember some discussion with @andrewvc about this, but not sure if there was a direction for this. I ask because this feature is only available for 2.0, however we could have a patch to change this value in 1.9 and see how it works.
My preference would be not to change to 2.0 as we need time to test, but open to proposals for sure. Will update the PR with the patch solution for 1.9 and wait for your input.
What do you think?
@jordansissel updated the PR with a timeout module remove proposal, let me know what do you think.
I wonder how are we with the move to use the 2.0 codebase
We won't be moving to 2.0 (and also jruby 9k) until after 5.0.0
@jordansissel then my current proposal at #27 is good as I patched resolv in 1.9 to introduce this timeouts. Let me know how it works for you? makes sense?
FYI I looked at this a bit in jruby/jruby#4048. The example there does indeed produce the same error, but I'd expect it to produce that error since the native thread is being interrupted while it waits on a mutex. The real problem here is finding why that happens at runtime in Logstash. We need to trap the interrupt somehow.
Do you have any updates about this issue?
Same issue happening here, or at least something similar
Logstash 2.4.0 CentOS 6.8 x84 Puppet writing a whole lot of hosts to /etc/hosts, no dns server because used in environment not connected to the big bad internet.
Error mesage
ConcurrencyError: interrupted waiting for mutex: null
lock at org/jruby/ext/thread/Mutex.java:94
synchronize at org/jruby/ext/thread/Mutex.java:147
lazy_initialize at /opt/logstash/vendor/jruby/lib/ruby/1.9/resolv.rb:190
each_name at /opt/logstash/vendor/jruby/lib/ruby/1.9/resolv.rb:268
each_name at /opt/logstash/vendor/jruby/lib/ruby/1.9/resolv.rb:151
each at org/jruby/RubyArray.java:1613
each_name at /opt/logstash/vendor/jruby/lib/ruby/1.9/resolv.rb:150
getname at /opt/logstash/vendor/jruby/lib/ruby/1.9/resolv.rb:132
getname at /opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-dns-2.1.3/lib/logstash/filters/dns.rb:244
retriable_getname at /opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-dns-2.1.3/lib/logstash/filters/dns.rb:231
call at org/jruby/RubyProc.java:281
retriable_request at /opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-dns-2.1.3/lib/logstash/filters/dns.rb:216
timeout at org/jruby/ext/timeout/Timeout.java:115
retriable_request at /opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-dns-2.1.3/lib/logstash/filters/dns.rb:215
retriable_getname at /opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-dns-2.1.3/lib/logstash/filters/dns.rb:230
reverse at /opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-dns-2.1.3/lib/logstash/filters/dns.rb:176
getset at /opt/logstash/vendor/bundle/jruby/1.9/gems/lru_redux-1.1.0/lib/lru_redux/cache.rb:34
getset at /opt/logstash/vendor/bundle/jruby/1.9/gems/lru_redux-1.1.0/lib/lru_redux/util/safe_sync_jruby.rb:6
mon_synchronize at /opt/logstash/vendor/jruby/lib/ruby/1.9/monitor.rb:211
mon_synchronize at /opt/logstash/vendor/jruby/lib/ruby/1.9/monitor.rb:210
getset at /opt/logstash/vendor/bundle/jruby/1.9/gems/lru_redux-1.1.0/lib/lru_redux/util/safe_sync_jruby.rb:5
reverse at /opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-dns-2.1.3/lib/logstash/filters/dns.rb:176
each at org/jruby/RubyArray.java:1613
reverse at /opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-dns-2.1.3/lib/logstash/filters/dns.rb:156
filter at /opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-dns-2.1.3/lib/logstash/filters/dns.rb:95
multi_filter at /opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/filters/base.rb:151
each at org/jruby/RubyArray.java:1613
multi_filter at /opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/filters/base.rb:148
initialize at (eval):41679
each at org/jruby/RubyArray.java:1613
initialize at (eval):41675
call at org/jruby/RubyProc.java:281
filter_func at (eval):5309
filter_batch at /opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/pipeline.rb:267
each at org/jruby/RubyArray.java:1613
inject at org/jruby/RubyEnumerable.java:852
filter_batch at /opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/pipeline.rb:265
worker_loop at /opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/pipeline.rb:223
start_workers at /opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/pipeline.rb:201
Does this issue get resolved in any version ???
Under a concrete set of circumstances there might be a concurrency problem while processing the /etc/hosts file to setup necessary resolvers. This situation might only happen when the one filter instance is stuck setting up the hosts but this thread and then another instance got kill while waiting, making this tricky so reproduce with a living LS instance.
The condition could be reproduced independently of logstash using https://gist.github.com/purbon/106e80a5d85b3a6fbf1e5f10b5d0d643 code.
The exception reported is:
This problem is independant of LS code, it has been reported in several different environments see [1] and [2] as examples.
Plan of action: