logstash-plugins / logstash-filter-dns

Apache License 2.0
7 stars 28 forks source link

Logstash Crash with ConcurrencyError: interrupted waiting for mutex: null #26

Open purbon opened 8 years ago

purbon commented 8 years ago

Under a concrete set of circumstances there might be a concurrency problem while processing the /etc/hosts file to setup necessary resolvers. This situation might only happen when the one filter instance is stuck setting up the hosts but this thread and then another instance got kill while waiting, making this tricky so reproduce with a living LS instance.

The condition could be reproduced independently of logstash using https://gist.github.com/purbon/106e80a5d85b3a6fbf1e5f10b5d0d643 code.

The exception reported is:

"exception"=>#<ConcurrencyError: interruptedwaitingformutex: null>,
    "backtrace"=>["org/jruby/ext/thread/Mutex.java:94:in `lock'",
    "org/jruby/ext/thread/Mutex.java:147:in `synchronize'",
    "/opt/logstash/vendor/jruby/lib/ruby/1.9/resolv.rb:190:in `lazy_initialize'",
    "/opt/logstash/vendor/jruby/lib/ruby/1.9/resolv.rb:268:in `each_name'",
    "/opt/logstash/vendor/jruby/lib/ruby/1.9/resolv.rb:151:in `each_name'",
    "org/jruby/RubyArray.java:1613:in `each'",
    "/opt/logstash/vendor/jruby/lib/ruby/1.9/resolv.rb:150:in `each_name'",
    "/opt/logstash/vendor/jruby/lib/ruby/1.9/resolv.rb:132:in `getname'",
    "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-dns-2.1.3/lib/logstash/filters/dns.rb:244:in `getname'",
    "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-dns-2.1.3/lib/logstash/filters/dns.rb:231:in `retriable_getname'",
    "org/jruby/RubyProc.java:281:in `call'",
    "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-dns-2.1.3/lib/logstash/filters/dns.rb:216:in `retriable_request'",
    "org/jruby/ext/timeout/Timeout.java:115:in `timeout'",
    "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-dns-2.1.3/lib/logstash/filters/dns.rb:215:in `retriable_request'",
    "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-dns-2.1.3/lib/logstash/filters/dns.rb:230:in `retriable_getname'",
    "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-dns-2.1.3/lib/logstash/filters/dns.rb:178:in `reverse'",
    "org/jruby/RubyArray.java:1613:in `each'",
    "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-dns-2.1.3/lib/logstash/filters/dns.rb:156:in `reverse'",

This problem is independant of LS code, it has been reported in several different environments see [1] and [2] as examples.

Plan of action:

jordansissel commented 8 years ago

I don't yet agree fully with the problem description.

Your sample code is clear, thank you for that! However, my main concern is that the dns filter is interrupted at all -- what causes this?

Some notes from reading the code: seems like Mutex#synchronize is being interrupted. Interrupting a thread in a critical section seems like a bad practice. Anyway, the exception "/opt/logstash/vendor/jruby/lib/ruby/1.9/resolv.rb:190:in 'lazy_initialize'", appears to be while the resolv library is waiting for ownership of the mutex. The exception "org/jruby/ext/timeout/Timeout.java:115:in 'timeout'", tells me we are using the timeout library to interrupt the dns lookup after an expiration time. Indeed, the Timeout library is used in the dns filter.

The problem I see is that we are using the Timeout library to abort lookups that take too long. This timeout interrupts the lookup thread and, I believe, results in this ConcurrencyError: interruptedwaitingformutex: null error.

Some possible solutions I think of:

  1. Avoid using Timeout library and signal timeouts with another mechanism
  2. or, rescue ConcurrencyError and have the code understand that the exception is a timeout if the exception message includes "interruptedwaitingformutex"
purbon commented 8 years ago

that is why I love discussion 👍, I agree timeout could be the source of getting this thread killed, but problem is more tricky to debug I guess. Even with very small timeout value we did not manage to reproduce this issue with a logstash pipeline.

I have a an open question:

I agree your points are fair also as solutions, but have also a few open questions:

re: avoid using timeout:

re : rescue ConcurrencyError

This is basically why I when for the mutex to "protect" the resolve operation to be threadsafe.

I think in case we decide to remove the mutex and go for one of your proposals I would think more about 2., however I would like to rollout possible masquerading.

What do you think?

jordansissel commented 8 years ago

we still need somehow a timeout mechanism

The Resolv library appears to support timeouts for DNS lookups: http://ruby-doc.org/stdlib-2.1.0/libdoc/resolv/rdoc/Resolv/DNS.html#method-i-timeouts-3D

purbon commented 8 years ago

@jordansissel this is a very good idea, will update my proposal with this.

purbon commented 8 years ago

I reviewed the code and I wonder how are we with the move to use the 2.0 codebase, I remember some discussion with @andrewvc about this, but not sure if there was a direction for this. I ask because this feature is only available for 2.0, however we could have a patch to change this value in 1.9 and see how it works.

My preference would be not to change to 2.0 as we need time to test, but open to proposals for sure. Will update the PR with the patch solution for 1.9 and wait for your input.

What do you think?

purbon commented 8 years ago

@jordansissel updated the PR with a timeout module remove proposal, let me know what do you think.

jordansissel commented 8 years ago

I wonder how are we with the move to use the 2.0 codebase

We won't be moving to 2.0 (and also jruby 9k) until after 5.0.0

purbon commented 8 years ago

@jordansissel then my current proposal at #27 is good as I patched resolv in 1.9 to introduce this timeouts. Let me know how it works for you? makes sense?

headius commented 8 years ago

FYI I looked at this a bit in jruby/jruby#4048. The example there does indeed produce the same error, but I'd expect it to produce that error since the native thread is being interrupted while it waits on a mutex. The real problem here is finding why that happens at runtime in Logstash. We need to trap the interrupt somehow.

fedelemantuano commented 7 years ago

Do you have any updates about this issue?

VuokkoVuorinnen commented 7 years ago

Same issue happening here, or at least something similar

Logstash 2.4.0 CentOS 6.8 x84 Puppet writing a whole lot of hosts to /etc/hosts, no dns server because used in environment not connected to the big bad internet.

Error mesage

ConcurrencyError: interrupted waiting for mutex: null
               lock at org/jruby/ext/thread/Mutex.java:94
        synchronize at org/jruby/ext/thread/Mutex.java:147
    lazy_initialize at /opt/logstash/vendor/jruby/lib/ruby/1.9/resolv.rb:190
          each_name at /opt/logstash/vendor/jruby/lib/ruby/1.9/resolv.rb:268
          each_name at /opt/logstash/vendor/jruby/lib/ruby/1.9/resolv.rb:151
               each at org/jruby/RubyArray.java:1613
          each_name at /opt/logstash/vendor/jruby/lib/ruby/1.9/resolv.rb:150
            getname at /opt/logstash/vendor/jruby/lib/ruby/1.9/resolv.rb:132
            getname at /opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-dns-2.1.3/lib/logstash/filters/dns.rb:244
  retriable_getname at /opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-dns-2.1.3/lib/logstash/filters/dns.rb:231
               call at org/jruby/RubyProc.java:281
  retriable_request at /opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-dns-2.1.3/lib/logstash/filters/dns.rb:216
            timeout at org/jruby/ext/timeout/Timeout.java:115
  retriable_request at /opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-dns-2.1.3/lib/logstash/filters/dns.rb:215
  retriable_getname at /opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-dns-2.1.3/lib/logstash/filters/dns.rb:230
            reverse at /opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-dns-2.1.3/lib/logstash/filters/dns.rb:176
             getset at /opt/logstash/vendor/bundle/jruby/1.9/gems/lru_redux-1.1.0/lib/lru_redux/cache.rb:34
             getset at /opt/logstash/vendor/bundle/jruby/1.9/gems/lru_redux-1.1.0/lib/lru_redux/util/safe_sync_jruby.rb:6
    mon_synchronize at /opt/logstash/vendor/jruby/lib/ruby/1.9/monitor.rb:211
    mon_synchronize at /opt/logstash/vendor/jruby/lib/ruby/1.9/monitor.rb:210
             getset at /opt/logstash/vendor/bundle/jruby/1.9/gems/lru_redux-1.1.0/lib/lru_redux/util/safe_sync_jruby.rb:5
            reverse at /opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-dns-2.1.3/lib/logstash/filters/dns.rb:176
               each at org/jruby/RubyArray.java:1613
            reverse at /opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-dns-2.1.3/lib/logstash/filters/dns.rb:156
             filter at /opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-dns-2.1.3/lib/logstash/filters/dns.rb:95
       multi_filter at /opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/filters/base.rb:151
               each at org/jruby/RubyArray.java:1613
       multi_filter at /opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/filters/base.rb:148
         initialize at (eval):41679
               each at org/jruby/RubyArray.java:1613
         initialize at (eval):41675
               call at org/jruby/RubyProc.java:281
        filter_func at (eval):5309
       filter_batch at /opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/pipeline.rb:267
               each at org/jruby/RubyArray.java:1613
             inject at org/jruby/RubyEnumerable.java:852
       filter_batch at /opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/pipeline.rb:265
        worker_loop at /opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/pipeline.rb:223
      start_workers at /opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/pipeline.rb:201
sumitkgaur commented 7 years ago

Does this issue get resolved in any version ???