Logstash often stops to send data to influxdb after influxdb shutdown/restart or network disconnect

evgygor commented 9 years ago

Hi, I have the cluster of 10 RHEL 6.4 servers. 9 servers with logstash-1.5.0.rc2-1 1 server with logstash-1.5.0.beta1-1 All servers have logstash-output-influxdb.

I check the logstash behavior in case of influxdb crash/kill/shutdown for several minutes and restart, or network disconnect for server seconds or minutes The results as follow:

Kill the influxdb for several minutes: --Message start to appear in logstash.log on each server "{:timestamp=>"2015-03-25T14:21:13.456000+0200", :message=>"Failed to flush outgoing items", :outgoing_count=>1, :exception=>Errno EBADF: Bad file descriptor - Bad file descr iptor>, :backtrace=>["org/jruby/RubyIO.java:2099:in close'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.42/lib/ftw/connection.rb:176:inconnect'", "org/jruby/RubyArray.java:1613: in each'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.42/lib/ftw/connection.rb:142:inconnect'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.42/lib/ftw/agent.rb:421:in c onnect'", "org/jruby/RubyProc.java:271:incall'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.42/lib/ftw/pool.rb:49:in fetch'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0 .42/lib/ftw/agent.rb:418:inconnect'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.42/lib/ftw/agent.rb:283:in execute'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/ftw-0.0.42/lib /ftw/agent.rb:181:inpost!'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-output-influxdb-0.1.3/lib/logstash/outputs/influxdb.rb:237:in post'", "/opt/logstash/vendor/bundle/jruby /1.9/gems/logstash-output-influxdb-0.1.3/lib/logstash/outputs/influxdb.rb:231:inflush'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.19/lib/stud/buffer.rb:219:in buffer_flush'", "org/jruby/RubyHash.java:1341:ineach'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.19/lib/stud/buffer.rb:216:in buffer_flush'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/stu d-0.0.19/lib/stud/buffer.rb:193:inbuffer_flush'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.19/lib/stud/buffer.rb:159:in buffer_receive'", "/opt/logstash/vendor/bundle/jruby/1 .9/gems/logstash-output-influxdb-0.1.3/lib/logstash/outputs/influxdb.rb:196:inreceive'", "/opt/logstash/lib/logstash/outputs/base.rb:88:in handle'", "(eval):4529:ininitialize'", "org/j ruby/RubyProc.java:271:in call'", "/opt/logstash/lib/logstash/pipeline.rb:279:inoutput'", "/opt/logstash/lib/logstash/pipeline.rb:235:in outputworker'", "/opt/logstash/lib/logstash/pipe line.rb:163:instart_outputs'"], :level=>:warn}"

--Data doesn't sent to the influxdb. Starting the influxdb: --The message stops to appear on all servers. --Several servers started to send data again to the influxdb as well. --Other never recover the sending, only kill -9 and new start solved the issue. --Even '/etc/init.d/logstash stop' doesn't work as in the bug #2796.

I tried this test several times - each time another logstashes recovered and other stucks. I checked the network status on servers with stucked logstash - the port 8086 (influxdb rest) was in close_wait status.

In the second test I shutdown the NIC for few minutes and started it back. Same logstash's behavior found except one - no errors appeared in the log file at all.

wblakecaldwell commented 9 years ago

There's another issue that I fixed here: https://github.com/logstash-plugins/logstash-output-influxdb/pull/15

It retries over and over to POST data to InfluxDB when it's down (good!), but duplicates the batch data on every attempt (bad!).

Typositoire commented 9 years ago

I was able to isolate that issue.

In the below snippet from Stud::Buffer, you can see that if you have an exception in buffer_flush() it sleep 1 and retry. So in my case influx hosted what sending me a weird http response and I had http parse exception from ftw. All I did was to put a begin/rescue block around post() or any other logic if you want to retry a couple of times. But otherwise, it will freeze logstash completely requiring a kill -9.

@buffer_state[:outgoing_items].each do |group, events|
          begin
            if group.nil?
              flush(events,final)
            else
              flush(events, group, final)
            end

            @buffer_state[:outgoing_items].delete(group)
            events_size = events.size
            @buffer_state[:outgoing_count] -= events_size
            items_flushed += events_size

          rescue => e

            @buffer_config[:logger].warn("Failed to flush outgoing items",
              :outgoing_count => @buffer_state[:outgoing_count],
              :exception => e,
              :backtrace => e.backtrace
            ) if @buffer_config[:logger]

            if @buffer_config[:has_on_flush_error]
              on_flush_error e
            end

            sleep 1
            retry
          end
          @buffer_state[:last_flush] = Time.now
        end

      ensure
        @buffer_state[:flush_mutex].unlock
      end

evgygor commented 9 years ago

This doesn't solves the issue with influxdb plugin. The plugin already uses begin/rescue block:

begin @logger.debug("Post body: #{body}") response = @agent.post!(@url, :body => body) rescue EOFError @logger.warn("EOF while writing request or reading response header from InfluxDB", :host => @host, :port => @port) return # abort this flush end

evgygor commented 9 years ago

I suggest that the issues caused by FTW::Agent, seems that FTW cannot recreate the connection. So, I disabled the FTW and just implemented Net::HTTP::Post instead. Now plugin reconnects.

andrewvc commented 7 years ago

This should be fixed in v 5.0.0 Thanks to #55. I'm going to close this issue given that. If you continue to encounter an issues please open a new issue. Thanks :)

logstash-plugins / logstash-output-influxdb

Logstash often stops to send data to influxdb after influxdb shutdown/restart or network disconnect #18