Closed robin-anil closed 8 years ago
@r-tock by default kafka stores its offset metadata in ZK. We use the high level consumer described here, which automatically stores offsets every 60s.
Do you have a reproducible case?
Also, was this a hard restart, like kill -9
?
this was the default restart i.e. sudo service logstash restart
Maybe the shutdown bug is affecting saving off offsets ? https://github.com/logstash-plugins/logstash-input-kafka/issues/49
I am able to reproduce this any time. just stop logstash and start again, all the logs in between are lost.
Can you please check your consumer offsets?
bin/kafka-run-class.sh kafka.tools.ConsumerOffsetChecker
Also is this the only Logstash/Kafka consumer you are using with this topic?
This discussion is probably better for the logstash discussion group vs an issue here as issues generally have steps to reproduce a problem and I'm unsure if this is really a bug.
I have currently a case where the logstash kafka consumer is lagging behind. Now this happend twice this week, and everytime I restarted logstash kafka input the plugin continued from the latest point losing all the logs. Since I have the problem in my grip at the moment, I would like the folks here to tell me how I can check the logstash (without restarting) to help you debug what is going on.
@joekiller @suyograo
Ping. I would like to restart the logstash instance. But I am waiting from the devs to tell me how to debug this, we are losing logs because of this bug.
can you check that zookeeper has offsets written for your group id? by default this group is called logstash
. and is the original config your config that you are using now? are you setting reset_beginning
?
@r-tock are the LSs with 0 lag experiencing the loss of messages? Also, can you check your ZK -- you'd have to use zkCli.sh
and navigate to the consumer group and partition -- and see if the offsets are written
@r-tock Also if you can paste some of the LS logs when you restarted, we'd like to look
@talevy The picture above are offsets from zookeeper for group logstash as viewed on kafka-manager
My logstash configuration is noted above in the very first post.
This is the logstash log from the last restart 4 days ago. If I restart know that logstash will continue from current position, that happened last two times. But it will lose all the logs in the process. At some point in the life-time of logstash, some of the partition readers seem wedged. Logs below don't show any issue.
{:timestamp=>"2016-02-03T23:03:10.959000+0000", :message=>"SIGTERM received. Shutting down the pipeline.", :level=>:warn}
{:timestamp=>"2016-02-03T23:03:15.997000+0000", :level=>:warn, "INFLIGHT_EVENT_COUNT"=>{"total"=>0}, "STALLING_THREADS"=>{["LogStash::Inputs::Kafka", {"zk_connect"=>"zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181", "topic_id"=>"demo_logs", "consumer_threads"=>1, "consumer_restart_on_error"=>"true", "consumer_restart_sleep_ms"=>100, "decorate_events"=>"true", "type"=>"demo_logs"}]=>[{"thread_id"=>16, "name"=>"<kafka", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-2.0.3/lib/logstash/inputs/kafka.rb:139:in `pop'"}], ["LogStash::Inputs::Kafka", {"zk_connect"=>"zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181", "topic_id"=>"demo_exchange_logs", "consumer_threads"=>1, "consumer_restart_on_error"=>"true", "consumer_restart_sleep_ms"=>100, "decorate_events"=>"true", "type"=>"demo_ex_logs"}]=>[{"thread_id"=>17, "name"=>"<kafka", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-2.0.3/lib/logstash/inputs/kafka.rb:139:in `pop'"}], ["LogStash::Inputs::Kafka", {"zk_connect"=>"zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181", "topic_id"=>"local_exchange_logs", "reset_beginning"=>"true", "consumer_threads"=>1, "consumer_restart_on_error"=>"true", "consumer_restart_sleep_ms"=>100, "decorate_events"=>"true", "type"=>"local_ex_logs"}]=>[{"thread_id"=>18, "name"=>"<kafka", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-2.0.3/lib/logstash/inputs/kafka.rb:139:in `pop'"}], ["LogStash::Inputs::Kafka", {"zk_connect"=>"zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181", "topic_id"=>"prod_logs", "consumer_threads"=>1, "consumer_restart_on_error"=>"true", "consumer_restart_sleep_ms"=>100, "decorate_events"=>"true", "type"=>"prod_logs"}]=>[{"thread_id"=>19, "name"=>"<kafka", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-2.0.3/lib/logstash/inputs/kafka.rb:139:in `pop'"}], ["LogStash::Inputs::Kafka", {"zk_connect"=>"zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181", "topic_id"=>"prod_exchange_logs", "consumer_threads"=>1, "consumer_restart_on_error"=>"true", "consumer_restart_sleep_ms"=>100, "decorate_events"=>"true", "type"=>"prod_ex_logs"}]=>[{"thread_id"=>20, "name"=>"<kafka", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-2.0.3/lib/logstash/inputs/kafka.rb:139:in `pop'"}], ["LogStash::Inputs::Kafka", {"zk_connect"=>"zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181", "topic_id"=>"staging_logs", "consumer_threads"=>1, "consumer_restart_on_error"=>"true", "consumer_restart_sleep_ms"=>100, "decorate_events"=>"true", "type"=>"staging_logs"}]=>[{"thread_id"=>21, "name"=>"<kafka", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-2.0.3/lib/logstash/inputs/kafka.rb:139:in `pop'"}], ["LogStash::Inputs::Kafka", {"zk_connect"=>"zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181", "topic_id"=>"staging_exchange_logs", "consumer_threads"=>1, "consumer_restart_on_error"=>"true", "consumer_restart_sleep_ms"=>100, "decorate_events"=>"true", "type"=>"staging_ex_logs"}]=>[{"thread_id"=>22, "name"=>"<kafka", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-2.0.3/lib/logstash/inputs/kafka.rb:139:in `pop'"}], ["LogStash::Inputs::Syslog", {"type"=>"syslog", "port"=>10000}]=>[{"thread_id"=>24, "name"=>"<syslog", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-syslog-2.0.2/lib/logstash/inputs/syslog.rb:106:in `join'"}]}}
{:timestamp=>"2016-02-03T23:03:16.010000+0000", :message=>"The shutdown process appears to be stalled due to busy or blocked plugins. Check the logs for more information.", :level=>:error}
{:timestamp=>"2016-02-03T23:03:20.973000+0000", :level=>:warn, "INFLIGHT_EVENT_COUNT"=>{"total"=>0}, "STALLING_THREADS"=>{["LogStash::Inputs::Kafka", {"zk_connect"=>"zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181", "topic_id"=>"demo_logs", "consumer_threads"=>1, "consumer_restart_on_error"=>"true", "consumer_restart_sleep_ms"=>100, "decorate_events"=>"true", "type"=>"demo_logs"}]=>[{"thread_id"=>16, "name"=>"<kafka", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-2.0.3/lib/logstash/inputs/kafka.rb:139:in `pop'"}], ["LogStash::Inputs::Kafka", {"zk_connect"=>"zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181", "topic_id"=>"demo_exchange_logs", "consumer_threads"=>1, "consumer_restart_on_error"=>"true", "consumer_restart_sleep_ms"=>100, "decorate_events"=>"true", "type"=>"demo_ex_logs"}]=>[{"thread_id"=>17, "name"=>"<kafka", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-2.0.3/lib/logstash/inputs/kafka.rb:139:in `pop'"}], ["LogStash::Inputs::Kafka", {"zk_connect"=>"zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181", "topic_id"=>"local_exchange_logs", "reset_beginning"=>"true", "consumer_threads"=>1, "consumer_restart_on_error"=>"true", "consumer_restart_sleep_ms"=>100, "decorate_events"=>"true", "type"=>"local_ex_logs"}]=>[{"thread_id"=>18, "name"=>"<kafka", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-2.0.3/lib/logstash/inputs/kafka.rb:139:in `pop'"}], ["LogStash::Inputs::Kafka", {"zk_connect"=>"zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181", "topic_id"=>"prod_logs", "consumer_threads"=>1, "consumer_restart_on_error"=>"true", "consumer_restart_sleep_ms"=>100, "decorate_events"=>"true", "type"=>"prod_logs"}]=>[{"thread_id"=>19, "name"=>"<kafka", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-2.0.3/lib/logstash/inputs/kafka.rb:139:in `pop'"}], ["LogStash::Inputs::Kafka", {"zk_connect"=>"zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181", "topic_id"=>"prod_exchange_logs", "consumer_threads"=>1, "consumer_restart_on_error"=>"true", "consumer_restart_sleep_ms"=>100, "decorate_events"=>"true", "type"=>"prod_ex_logs"}]=>[{"thread_id"=>20, "name"=>"<kafka", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-2.0.3/lib/logstash/inputs/kafka.rb:139:in `pop'"}], ["LogStash::Inputs::Kafka", {"zk_connect"=>"zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181", "topic_id"=>"staging_logs", "consumer_threads"=>1, "consumer_restart_on_error"=>"true", "consumer_restart_sleep_ms"=>100, "decorate_events"=>"true", "type"=>"staging_logs"}]=>[{"thread_id"=>21, "name"=>"<kafka", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-2.0.3/lib/logstash/inputs/kafka.rb:139:in `pop'"}], ["LogStash::Inputs::Kafka", {"zk_connect"=>"zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181", "topic_id"=>"staging_exchange_logs", "consumer_threads"=>1, "consumer_restart_on_error"=>"true", "consumer_restart_sleep_ms"=>100, "decorate_events"=>"true", "type"=>"staging_ex_logs"}]=>[{"thread_id"=>22, "name"=>"<kafka", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-2.0.3/lib/logstash/inputs/kafka.rb:139:in `pop'"}], ["LogStash::Inputs::Syslog", {"type"=>"syslog", "port"=>10000}]=>[{"thread_id"=>24, "name"=>"<syslog", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-syslog-2.0.2/lib/logstash/inputs/syslog.rb:106:in `join'"}]}}
{:timestamp=>"2016-02-03T23:03:25.982000+0000", :level=>:warn, "INFLIGHT_EVENT_COUNT"=>{"total"=>0}, "STALLING_THREADS"=>{["LogStash::Inputs::Kafka", {"zk_connect"=>"zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181", "topic_id"=>"demo_logs", "consumer_threads"=>1, "consumer_restart_on_error"=>"true", "consumer_restart_sleep_ms"=>100, "decorate_events"=>"true", "type"=>"demo_logs"}]=>[{"thread_id"=>16, "name"=>"<kafka", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-2.0.3/lib/logstash/inputs/kafka.rb:139:in `pop'"}], ["LogStash::Inputs::Kafka", {"zk_connect"=>"zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181", "topic_id"=>"demo_exchange_logs", "consumer_threads"=>1, "consumer_restart_on_error"=>"true", "consumer_restart_sleep_ms"=>100, "decorate_events"=>"true", "type"=>"demo_ex_logs"}]=>[{"thread_id"=>17, "name"=>"<kafka", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-2.0.3/lib/logstash/inputs/kafka.rb:139:in `pop'"}], ["LogStash::Inputs::Kafka", {"zk_connect"=>"zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181", "topic_id"=>"local_exchange_logs", "reset_beginning"=>"true", "consumer_threads"=>1, "consumer_restart_on_error"=>"true", "consumer_restart_sleep_ms"=>100, "decorate_events"=>"true", "type"=>"local_ex_logs"}]=>[{"thread_id"=>18, "name"=>"<kafka", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-2.0.3/lib/logstash/inputs/kafka.rb:139:in `pop'"}], ["LogStash::Inputs::Kafka", {"zk_connect"=>"zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181", "topic_id"=>"prod_logs", "consumer_threads"=>1, "consumer_restart_on_error"=>"true", "consumer_restart_sleep_ms"=>100, "decorate_events"=>"true", "type"=>"prod_logs"}]=>[{"thread_id"=>19, "name"=>"<kafka", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-2.0.3/lib/logstash/inputs/kafka.rb:139:in `pop'"}], ["LogStash::Inputs::Kafka", {"zk_connect"=>"zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181", "topic_id"=>"prod_exchange_logs", "consumer_threads"=>1, "consumer_restart_on_error"=>"true", "consumer_restart_sleep_ms"=>100, "decorate_events"=>"true", "type"=>"prod_ex_logs"}]=>[{"thread_id"=>20, "name"=>"<kafka", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-2.0.3/lib/logstash/inputs/kafka.rb:139:in `pop'"}], ["LogStash::Inputs::Kafka", {"zk_connect"=>"zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181", "topic_id"=>"staging_logs", "consumer_threads"=>1, "consumer_restart_on_error"=>"true", "consumer_restart_sleep_ms"=>100, "decorate_events"=>"true", "type"=>"staging_logs"}]=>[{"thread_id"=>21, "name"=>"<kafka", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-2.0.3/lib/logstash/inputs/kafka.rb:139:in `pop'"}], ["LogStash::Inputs::Kafka", {"zk_connect"=>"zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181", "topic_id"=>"staging_exchange_logs", "consumer_threads"=>1, "consumer_restart_on_error"=>"true", "consumer_restart_sleep_ms"=>100, "decorate_events"=>"true", "type"=>"staging_ex_logs"}]=>[{"thread_id"=>22, "name"=>"<kafka", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-2.0.3/lib/logstash/inputs/kafka.rb:139:in `pop'"}], ["LogStash::Inputs::Syslog", {"type"=>"syslog", "port"=>10000}]=>[{"thread_id"=>24, "name"=>"<syslog", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-syslog-2.0.2/lib/logstash/inputs/syslog.rb:106:in `join'"}]}}
{:timestamp=>"2016-02-03T23:03:30.982000+0000", :level=>:warn, "INFLIGHT_EVENT_COUNT"=>{"total"=>0}, "STALLING_THREADS"=>{["LogStash::Inputs::Kafka", {"zk_connect"=>"zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181", "topic_id"=>"demo_logs", "consumer_threads"=>1, "consumer_restart_on_error"=>"true", "consumer_restart_sleep_ms"=>100, "decorate_events"=>"true", "type"=>"demo_logs"}]=>[{"thread_id"=>16, "name"=>"<kafka", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-2.0.3/lib/logstash/inputs/kafka.rb:139:in `pop'"}], ["LogStash::Inputs::Kafka", {"zk_connect"=>"zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181", "topic_id"=>"demo_exchange_logs", "consumer_threads"=>1, "consumer_restart_on_error"=>"true", "consumer_restart_sleep_ms"=>100, "decorate_events"=>"true", "type"=>"demo_ex_logs"}]=>[{"thread_id"=>17, "name"=>"<kafka", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-2.0.3/lib/logstash/inputs/kafka.rb:139:in `pop'"}], ["LogStash::Inputs::Kafka", {"zk_connect"=>"zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181", "topic_id"=>"local_exchange_logs", "reset_beginning"=>"true", "consumer_threads"=>1, "consumer_restart_on_error"=>"true", "consumer_restart_sleep_ms"=>100, "decorate_events"=>"true", "type"=>"local_ex_logs"}]=>[{"thread_id"=>18, "name"=>"<kafka", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-2.0.3/lib/logstash/inputs/kafka.rb:139:in `pop'"}], ["LogStash::Inputs::Kafka", {"zk_connect"=>"zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181", "topic_id"=>"prod_logs", "consumer_threads"=>1, "consumer_restart_on_error"=>"true", "consumer_restart_sleep_ms"=>100, "decorate_events"=>"true", "type"=>"prod_logs"}]=>[{"thread_id"=>19, "name"=>"<kafka", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-2.0.3/lib/logstash/inputs/kafka.rb:139:in `pop'"}], ["LogStash::Inputs::Kafka", {"zk_connect"=>"zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181", "topic_id"=>"prod_exchange_logs", "consumer_threads"=>1, "consumer_restart_on_error"=>"true", "consumer_restart_sleep_ms"=>100, "decorate_events"=>"true", "type"=>"prod_ex_logs"}]=>[{"thread_id"=>20, "name"=>"<kafka", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-2.0.3/lib/logstash/inputs/kafka.rb:139:in `pop'"}], ["LogStash::Inputs::Kafka", {"zk_connect"=>"zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181", "topic_id"=>"staging_logs", "consumer_threads"=>1, "consumer_restart_on_error"=>"true", "consumer_restart_sleep_ms"=>100, "decorate_events"=>"true", "type"=>"staging_logs"}]=>[{"thread_id"=>21, "name"=>"<kafka", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-2.0.3/lib/logstash/inputs/kafka.rb:139:in `pop'"}], ["LogStash::Inputs::Kafka", {"zk_connect"=>"zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181", "topic_id"=>"staging_exchange_logs", "consumer_threads"=>1, "consumer_restart_on_error"=>"true", "consumer_restart_sleep_ms"=>100, "decorate_events"=>"true", "type"=>"staging_ex_logs"}]=>[{"thread_id"=>22, "name"=>"<kafka", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-2.0.3/lib/logstash/inputs/kafka.rb:139:in `pop'"}], ["LogStash::Inputs::Syslog", {"type"=>"syslog", "port"=>10000}]=>[{"thread_id"=>24, "name"=>"<syslog", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-syslog-2.0.2/lib/logstash/inputs/syslog.rb:106:in `join'"}]}}
{:timestamp=>"2016-02-03T23:03:32.621000+0000", :message=>"SIGTERM received. Shutting down the pipeline.", :level=>:warn}
{:timestamp=>"2016-02-03T23:03:58.335000+0000", :message=>"kafka client threw exception, restarting", :exception=>kafka.common.ConsumerRebalanceFailedException: logstash_logstash.c.rapid-depot-817.internal-1454540628335-cc55ed4a can't rebalance after 4 retries, :level=>:warn}
{:timestamp=>"2016-02-03T23:03:58.334000+0000", :message=>"kafka client threw exception, restarting", :exception=>kafka.common.ConsumerRebalanceFailedException: logstash_logstash.c.rapid-depot-817.internal-1454540628334-80251e75 can't rebalance after 4 retries, :level=>:warn}
{:timestamp=>"2016-02-03T23:03:58.359000+0000", :message=>"kafka client threw exception, restarting", :exception=>kafka.common.ConsumerRebalanceFailedException: logstash_logstash.c.rapid-depot-817.internal-1454540628410-23644bc2 can't rebalance after 4 retries, :level=>:warn}
{:timestamp=>"2016-02-03T23:03:58.380000+0000", :message=>"kafka client threw exception, restarting", :exception=>kafka.common.ConsumerRebalanceFailedException: logstash_logstash.c.rapid-depot-817.internal-1454540628336-ed1539b2 can't rebalance after 4 retries, :level=>:warn}
{:timestamp=>"2016-02-03T23:03:58.382000+0000", :message=>"kafka client threw exception, restarting", :exception=>kafka.common.ConsumerRebalanceFailedException: logstash_logstash.c.rapid-depot-817.internal-1454540628330-fdd175c0 can't rebalance after 4 retries, :level=>:warn}
{:timestamp=>"2016-02-03T23:03:58.392000+0000", :message=>"kafka client threw exception, restarting", :exception=>kafka.common.ConsumerRebalanceFailedException: logstash_logstash.c.rapid-depot-817.internal-1454540628336-f50f699f can't rebalance after 4 retries, :level=>:warn}
This is the current state of all our consumers for group logstash. We have only one logstash consumer instance btw and all of them are reading from 20 partitions over about 10ish topics
@r-tock can you remove consumer_restart_on_error
. Not sure if this is related to the issue, want to eliminate the possibility
I say this because of logs:
{:timestamp=>"2016-02-03T23:03:58.335000+0000", :message=>"kafka client threw exception, restarting", :exception=>kafka.common.ConsumerRebalanceFailedException: logstash_logstash.c.rapid-depot-817.internal-1454540628335-cc55ed4a can't rebalance after 4 retries, :level=>:warn}
{:timestamp=>"2016-02-03T23:03:58.334000+0000", :message=>"kafka client threw exception, restarting", :exception=>kafka.common.ConsumerRebalanceFailedException: logstash_logstash.c.rapid-depot-817.internal-1454540628334-80251e75 can't rebalance after 4 retries, :level=>:warn}
{:timestamp=>"2016-02-03T23:03:58.359000+0000", :message=>"kafka client threw exception, restarting", :exception=>kafka.common.ConsumerRebalanceFailedException: logstash_logstash.c.rapid-depot-817.internal-1454540628410-23644bc2 can't rebalance after 4 retries, :level=>:warn}
{:timestamp=>"2016-02-03T23:03:58.380000+0000", :message=>"kafka client threw exception, restarting", :exception=>kafka.common.ConsumerRebalanceFailedException: logstash_logstash.c.rapid-depot-817.internal-1454540628336-ed1539b2 can't rebalance after 4 retries, :level=>:warn}
{:timestamp=>"2016-02-03T23:03:58.382000+0000", :message=>"kafka client threw exception, restarting", :exception=>kafka.common.ConsumerRebalanceFailedException: logstash_logstash.c.rapid-depot-817.internal-1454540628330-fdd175c0 can't rebalance after 4 retries, :level=>:warn}
{:timestamp=>"2016-02-03T23:03:58.392000+0000", :message=>"kafka client threw exception, restarting", :exception=>kafka.common.ConsumerRebalanceFailedException: logstash_logstash.c.rapid-depot-817.internal-1454540628336-f50f699f can't rebalance after 4 retries, :level=>:warn}
Ok. Do you want me to restart the instance now after removing consumer_restart_on_error
, is there anything else I can give help on before the instance is restarted. The problem usually do not surface for few days so I want to take advantage while I have it on a tight leash
Please confirm you want me to restart logstash with
input {
kafka {
zk_connect => "zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181"
topic_id => "logs"
consumer_threads => 1
consumer_restart_sleep_ms => 100
decorate_events => true
type => "logs"
}
}
@r-tock hang on, debugging
regarding the unclean shutdown logs that you are seeing, we have fixed this in the latest version of the plugin, and you should not have to force shutdown as your are anymore. the pipeline shouldn't stall.
bin/plugin update logstash-input-kafka
to version 2.0.4
@talevy I tried that before 2 weeks ago I believe I got a ruby gem install error. So I rolled it back to 2.0.3. In hindsight I should have reported that, I just didn't have the time to follow up on that.
@r-tock Can you try bin/plugin install --version 2.0.4 logstash-input-kafka
Here is what I get
Validating logstash-input-kafka-2.0.4
Installing logstash-input-kafka
Plugin version conflict, aborting
ERROR: Installation Aborted, message: Bundler could not find compatible versions for gem "jruby-kafka":
In snapshot (Gemfile.lock):
jruby-kafka (= 1.4.0)
In Gemfile:
logstash-input-kafka (= 2.0.4) java depends on
jruby-kafka (= 1.5.0) java
Running `bundle update` will rebuild your snapshot from scratch, using only
the gems in your Gemfile, which may resolve the conflict.
It seems I cannot bundle update because this is within the binary
@r-tock can you live debug with me and @talevy ?
https://gist.github.com/r-tock/f97b695d21574b4a69cf Please find the jstack output for the process in the gist above
@suyograo @talevy Did you guys make any progress? Was the stack trace helpful. Let me know if there is anything else I can do before I upgrade and restart the instance with consumer_restart_on_error
removed.
Alright, at this point I am going to restart the server with upgrade and removing the consumer_restart_on_error
I can't see anything specific to the issue via the jstack. that sounds good.
post upgrade the restart is working better logstash does not stall as long but it still does.
However it is dropping events post a restart See the picture. The first big empty block is the upgrade and restart and the second one is a restart with the new instance.
Logs are warning about stalling threads and still have consumer rebalance exception. I also issued a partition reassignment request on kafka yesterday, so I cannot see anything weird on the kafka side.
Also note that the rebalance exception is for consumers that no longer exist. When I look at the consumer info on kafka, these instance owner ids don't match what is running currently.
{:timestamp=>"2016-02-10T18:16:02.928000+0000", :message=>"SIGTERM received. Shutting down the pipeline.", :level=>:warn}
{:timestamp=>"2016-02-10T18:16:07.953000+0000", :level=>:warn, "INFLIGHT_EVENT_COUNT"=>{"total"=>0}, "STALLING_THREADS"=>{["LogStash::Inputs::Kafka", {"zk_connect"=>"zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181", "topic_id"=>"prod_logs", "consumer_threads"=>1, "consumer_restart_sleep_ms"=>100, "decorate_events"=>"true", "type"=>"prod_logs"}]=>[{"thread_id"=>18, "name"=>"<kafka", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-2.0.4/lib/logstash/inputs/kafka.rb:144:in `pop'"}], ["LogStash::Inputs::Kafka", {"zk_connect"=>"zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181", "topic_id"=>"prod_exchange_logs", "consumer_threads"=>1, "consumer_restart_on_error"=>"true", "decorate_events"=>"true", "type"=>"prod_ex_logs"}]=>[{"thread_id"=>19, "name"=>"<kafka", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-2.0.4/lib/logstash/inputs/kafka.rb:144:in `pop'"}], ["LogStash::Inputs::Kafka", {"zk_connect"=>"zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181", "topic_id"=>"staging_logs", "consumer_threads"=>1, "consumer_restart_sleep_ms"=>100, "decorate_events"=>"true", "type"=>"staging_logs"}]=>[{"thread_id"=>20, "name"=>"<kafka", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-2.0.4/lib/logstash/inputs/kafka.rb:144:in `pop'"}], ["LogStash::Inputs::Kafka", {"zk_connect"=>"zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181", "topic_id"=>"staging_exchange_logs", "consumer_threads"=>1, "consumer_restart_sleep_ms"=>100, "decorate_events"=>"true", "type"=>"staging_ex_logs"}]=>[{"thread_id"=>21, "name"=>"<kafka", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-2.0.4/lib/logstash/inputs/kafka.rb:144:in `pop'"}], ["LogStash::Inputs::Syslog", {"type"=>"syslog", "port"=>10000}]=>[{"thread_id"=>22, "name"=>"<syslog", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-syslog-2.0.2/lib/logstash/inputs/syslog.rb:106:in `join'"}]}}
{:timestamp=>"2016-02-10T18:16:07.958000+0000", :message=>"The shutdown process appears to be stalled due to busy or blocked plugins. Check the logs for more information.", :level=>:error}
{:timestamp=>"2016-02-10T18:16:12.934000+0000", :level=>:warn, "INFLIGHT_EVENT_COUNT"=>{"total"=>0}, "STALLING_THREADS"=>{["LogStash::Inputs::Kafka", {"zk_connect"=>"zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181", "topic_id"=>"staging_logs", "consumer_threads"=>1, "consumer_restart_sleep_ms"=>100, "decorate_events"=>"true", "type"=>"staging_logs"}]=>[{"thread_id"=>20, "name"=>"<kafka", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-2.0.4/lib/logstash/inputs/kafka.rb:144:in `pop'"}], ["LogStash::Inputs::Kafka", {"zk_connect"=>"zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181", "topic_id"=>"staging_exchange_logs", "consumer_threads"=>1, "consumer_restart_sleep_ms"=>100, "decorate_events"=>"true", "type"=>"staging_ex_logs"}]=>[{"thread_id"=>21, "name"=>"<kafka", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-2.0.4/lib/logstash/inputs/kafka.rb:144:in `pop'"}], ["LogStash::Inputs::Syslog", {"type"=>"syslog", "port"=>10000}]=>[{"thread_id"=>22, "name"=>"<syslog", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-syslog-2.0.2/lib/logstash/inputs/syslog.rb:106:in `join'"}]}}
{:timestamp=>"2016-02-10T18:16:17.935000+0000", :level=>:warn, "INFLIGHT_EVENT_COUNT"=>{"total"=>0}, "STALLING_THREADS"=>{["LogStash::Inputs::Kafka", {"zk_connect"=>"zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181", "topic_id"=>"staging_logs", "consumer_threads"=>1, "consumer_restart_sleep_ms"=>100, "decorate_events"=>"true", "type"=>"staging_logs"}]=>[{"thread_id"=>20, "name"=>"<kafka", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-2.0.4/lib/logstash/inputs/kafka.rb:144:in `pop'"}], ["LogStash::Inputs::Kafka", {"zk_connect"=>"zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181", "topic_id"=>"staging_exchange_logs", "consumer_threads"=>1, "consumer_restart_sleep_ms"=>100, "decorate_events"=>"true", "type"=>"staging_ex_logs"}]=>[{"thread_id"=>21, "name"=>"<kafka", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-2.0.4/lib/logstash/inputs/kafka.rb:144:in `pop'"}], ["LogStash::Inputs::Syslog", {"type"=>"syslog", "port"=>10000}]=>[{"thread_id"=>22, "name"=>"<syslog", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-syslog-2.0.2/lib/logstash/inputs/syslog.rb:106:in `join'"}]}}
{:timestamp=>"2016-02-10T18:16:17.937000+0000", :message=>"The shutdown process appears to be stalled due to busy or blocked plugins. Check the logs for more information.", :level=>:error}
{:timestamp=>"2016-02-10T18:16:22.934000+0000", :level=>:warn, "INFLIGHT_EVENT_COUNT"=>{"total"=>0}, "STALLING_THREADS"=>{["LogStash::Inputs::Syslog", {"type"=>"syslog", "port"=>10000}]=>[{"thread_id"=>22, "name"=>"<syslog", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-syslog-2.0.2/lib/logstash/inputs/syslog.rb:106:in `join'"}]}}
{:timestamp=>"2016-02-10T18:16:27.936000+0000", :level=>:warn, "INFLIGHT_EVENT_COUNT"=>{"input_to_filter"=>1, "total"=>1}, "STALLING_THREADS"=>{"other"=>[{"thread_id"=>80, "name"=>">output", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-2.4.1-java/lib/logstash/outputs/elasticsearch/buffer.rb:51:in `join'"}]}}
{:timestamp=>"2016-02-10T18:17:30.465000+0000", :message=>"kafka client threw exception, restarting", :exception=>kafka.common.ConsumerRebalanceFailedException: logstash_logstash.c.rapid-depot-817.internal-1455128240517-d3980226 can't rebalance after 4 retries, :level=>:warn}
{:timestamp=>"2016-02-10T18:17:30.488000+0000", :message=>"kafka client threw exception, restarting", :exception=>kafka.common.ConsumerRebalanceFailedException: logstash_logstash.c.rapid-depot-817.internal-1455128240515-62c925a7 can't rebalance after 4 retries, :level=>:warn}
{:timestamp=>"2016-02-10T18:17:30.498000+0000", :message=>"kafka client threw exception, restarting", :exception=>kafka.common.ConsumerRebalanceFailedException: logstash_logstash.c.rapid-depot-817.internal-1455128240517-a4fec7e7 can't rebalance after 4 retries, :level=>:warn}
{:timestamp=>"2016-02-10T18:17:30.501000+0000", :message=>"kafka client threw exception, restarting", :exception=>kafka.common.ConsumerRebalanceFailedException: logstash_logstash.c.rapid-depot-817.internal-1455128240517-183843e1 can't rebalance after 4 retries, :level=>:warn}
{:timestamp=>"2016-02-10T18:17:30.601000+0000", :message=>"kafka client threw exception, restarting", :exception=>kafka.common.ConsumerRebalanceFailedException: logstash_logstash.c.rapid-depot-817.internal-1455128240598-6c90460f can't rebalance after 4 retries, :level=>:warn}
{:timestamp=>"2016-02-10T18:17:30.603000+0000", :message=>"kafka client threw exception, restarting", :exception=>kafka.common.ConsumerRebalanceFailedException: logstash_logstash.c.rapid-depot-817.internal-1455128240492-69a5fb05 can't rebalance after 4 retries, :level=>:warn}
Depending on the number of partitions you have and where the data resides in the Kafka queue you may see different log times appear in elasticsearch has the processes catch up. Check your group offsets. On Feb 10, 2016 1:22 PM, "Robin Anil" notifications@github.com wrote:
post upgrade the restart is working better logstash does not stall as long but it still does.
However it is dropping events post a restart See the picture. The first big empty block is the upgrade and restart and the second one is a restart with the new instance. [image: screen shot 2016-02-10 at 12 17 55 pm] https://cloud.githubusercontent.com/assets/11711723/12957019/6fa65344-cff0-11e5-9abb-17c8d454cd15.png
Logs are warning about stalling threads and still have consumer rebalance exception. I also issued a partition reassignment request on kafka yesterday, so I cannot see anything weird on the kafka side.
Also note that the rebalance exception is for consumers that no longer exist. When I look at the consumer info on kafka, these instance owner ids don't match what is running currently.
{:timestamp=>"2016-02-10T18:16:02.928000+0000", :message=>"SIGTERM received. Shutting down the pipeline.", :level=>:warn} {:timestamp=>"2016-02-10T18:16:07.953000+0000", :level=>:warn, "INFLIGHT_EVENT_COUNT"=>{"total"=>0}, "STALLING_THREADS"=>{["LogStash::Inputs::Kafka", {"zk_connect"=>"zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181", "topic_id"=>"prod_logs", "consumer_threads"=>1, "consumer_restart_sleep_ms"=>100, "decorate_events"=>"true", "type"=>"prod_logs"}]=>[{"thread_id"=>18, "name"=>"<kafka", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-2.0.4/lib/logstash/inputs/kafka.rb:144:in
pop'"}], ["LogStash::Inputs::Kafka", {"zk_connect"=>"zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181", "topic_id"=>"prod_exchange_logs", "consumer_threads"=>1, "consumer_restart_on_error"=>"true", "decorate_events"=>"true", "type"=>"prod_ex_logs"}]=>[{"thread_id"=>19, "name"=>"<kafka", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-2.0.4/lib/logstash/inputs/kafka.rb:144:in
pop'"}] , ["LogStash::Inputs::Kafka", {"zk_connect"=>"zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181", "topic_id"=>"staging_logs", "consumer_threads"=>1, "consumer_restart_sleep_ms"=>100, "decorate_events"=>"true", "type"=>"staging_logs"}]=>[{"thread_id"=>20, "name"=>"<kafka", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-2.0.4/lib/logstash/inputs/kafka.rb:144:inpop'"}], ["LogStash::Inputs::Kafka", {"zk_connect"=>"zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181", "topic_id"=>"staging_exchange_logs", "consumer_threads"=>1, "consumer_restart_sleep_ms"=>100, "decorate_events"=>"true", "type"=>"staging_ex_logs"}]=>[{"thread_id"=>21, "name"=>"<kafka", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-2.0.4/lib/logstash/inputs/kafka.rb:144:in
pop'"}], ["LogStash::Inputs::Syslog", {"type"=>"syslog", "port"=>10000}]=>[{"thread_id"=>22, "name"=>"<sys log", "c urrent_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-syslog-2.0.2/lib/logstash/inputs/syslog.rb:106:injoin'"}]}} {:timestamp=>"2016-02-10T18:16:07.958000+0000", :message=>"The shutdown process appears to be stalled due to busy or blocked plugins. Check the logs for more information.", :level=>:error} {:timestamp=>"2016-02-10T18:16:12.934000+0000", :level=>:warn, "INFLIGHT_EVENT_COUNT"=>{"total"=>0}, "STALLING_THREADS"=>{["LogStash::Inputs::Kafka", {"zk_connect"=>"zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181", "topic_id"=>"staging_logs", "consumer_threads"=>1, "consumer_restart_sleep_ms"=>100, "decorate_events"=>"true", "type"=>"staging_logs"}]=>[{"thread_id"=>20, "name"=>"<kafka", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-2.0.4/lib/logstash/inputs/kafka.rb:144:in
pop'"}], ["LogStash::Inputs::Kafka", {"zk_connect"=>"zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181", "topic_id"=>"staging_exchange_logs", "consumer_threads"=>1, "consumer_restart_sleep_ms"=>100, "decorate_events"=>"true", "type"=>"staging_ex_logs"}]=>[{"thread_id"=>21, "name"=>"<kafka", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-2.0.4/lib/logstash/inputs/kafka.r b:144:inpop'"}], ["LogStash::Inputs::Syslog", {"type"=>"syslog", "port"=>10000}]=>[{"thread_id"=>22, "name"=>"<syslog", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-syslog-2.0.2/lib/logstash/inputs/syslog.rb:106:in
join'"}]}} {:timestamp=>"2016-02-10T18:16:17.935000+0000", :level=>:warn, "INFLIGHT_EVENT_COUNT"=>{"total"=>0}, "STALLING_THREADS"=>{["LogStash::Inputs::Kafka", {"zk_connect"=>"zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181", "topic_id"=>"staging_logs", "consumer_threads"=>1, "consumer_restart_sleep_ms"=>100, "decorate_events"=>"true", "type"=>"staging_logs"}]=>[{"thread_id"=>20, "name"=>"<kafka", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-2.0.4/lib/logstash/inputs/kafka.rb:144:inpop'"}], ["LogStash::Inputs::Kafka", {"zk_connect"=>"zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181", "topic_id"=>"staging_exchange_logs", "consumer_threads"=>1, "consumer_restart_sleep_ms"=>100, "decorate_events"=>"true", "type"=>"staging_ex_logs"}]=>[{"thread_id"=>21, "name"=>"<kafka", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-2.0.4/lib/logstash/inputs/kafka.r b:144:in
pop'"}], ["LogStash::Inputs::Syslog", {"type"=>"syslog", "port"=>10000}]=>[{"thread_id"=>22, "name"=>"<syslog", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-syslog-2.0.2/lib/logstash/inputs/syslog.rb:106:injoin'"}]}} {:timestamp=>"2016-02-10T18:16:17.937000+0000", :message=>"The shutdown process appears to be stalled due to busy or blocked plugins. Check the logs for more information.", :level=>:error} {:timestamp=>"2016-02-10T18:16:22.934000+0000", :level=>:warn, "INFLIGHT_EVENT_COUNT"=>{"total"=>0}, "STALLING_THREADS"=>{["LogStash::Inputs::Syslog", {"type"=>"syslog", "port"=>10000}]=>[{"thread_id"=>22, "name"=>"<syslog", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-syslog-2.0.2/lib/logstash/inputs/syslog.rb:106:in
join'"}]}} {:timestamp=>"2016-02-10T18:16:27.936000+0000", :level=>:warn, "INFLIGHT_EVENT_COUNT"=>{"input_to_filter"=>1, "total"=>1}, "STALLING_THREADS"=>{"other"=>[{"thread_id"=>80, "name"=>">output", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-2.4.1-java/lib/logstash/outputs/elasticsearch/buffer.rb:51:in `join'"}]}} {:timestamp=>"2016-02-10T18:17:30.465000+0000", :message=>"kafka client threw exception, restarting", :exception=>kafka.common.ConsumerRebalanceFailedException: logstash_logstash.c.rapid-depot-817.internal-1455128240517-d3980226 can't rebalance after 4 retries, :level=>:warn} {:timestamp=>"2016-02-10T18:17:30.488000+0000", :message=>"kafka client threw exception, restarting", :exception=>kafka.common.ConsumerRebalanceFailedException: logstash_logstash.c.rapid-depot-817.internal-1455128240515-62c925a7 can't rebalance after 4 retries, :level=>:warn} {:timestamp=>"2016-02-10T18:17:30.498000+0000", :message=>"kafka client threw exception, restarting", :exception=>kafka.common.ConsumerRebalanceFailedException: logstash_logstash.c.rapid-depot-817.internal-1455128240517-a4fec7e7 can't rebalance after 4 retries, :level=>:warn} {:timestamp=>"2016-02-10T18:17:30.501000+0000", :message=>"kafka client threw exception, restarting", :exception=>kafka.common.ConsumerRebalanceFailedException: logstash_logstash.c.rapid-depot-817.internal-1455128240517-183843e1 can't rebalance after 4 retries, :level=>:warn} {:timestamp=>"2016-02-10T18:17:30.601000+0000", :message=>"kafka client threw exception, restarting", :exception=>kafka.common.ConsumerRebalanceFailedException: logstash_logstash.c.rapid-depot-817.internal-1455128240598-6c90460f can't rebalance after 4 retries, :level=>:warn} {:timestamp=>"2016-02-10T18:17:30.603000+0000", :message=>"kafka client threw exception, restarting", :exception=>kafka.common.ConsumerRebalanceFailedException: logstash_logstash.c.rapid-depot-817.internal-1455128240492-69a5fb05 can't rebalance after 4 retries, :level=>:warn}— Reply to this email directly or view it on GitHub https://github.com/logstash-plugins/logstash-input-kafka/issues/68#issuecomment-182514513 .
Log times are actually in the json payload. So when a catch up happens we will create insertions with old timestamp, I have done that to ensure I can re-index and re-process any time. I can assure you that this used to work previously while I had logstash 1.5
filter {
date {
match => [ "timestampMs", "UNIX_MS" ]
}
}
Looking at your consumer offsets from above, it looks like you have two different logstash-input-kafka inputs running on the same group.
0f3 appears to be fine, it has no offset lag. 9b2 is the non-working one. Now to explain your gap, or non-resumption problem, it appears that you have two independent consumers running on the same system (check your processes) each runs with the default id of logstash. One of the processes is consuming and may be doing so without you realizing it thus progressing the last offset so when you stop and start it, it appears like stuff is missing. the stuck process has the logs in their lag, so once you kill it, you should see the rest of the logs. The other was working on it's half so there is a chance that it just had the up to date logs in that partition. Either way, kill those other consumers and you should see the right behavior. BTW I think restart on errors is fine to use.
I have exactly one logstash process and when I restart it I have exactly one running.
The occam's razor explanation is that the logstash kafka input is opening up two inputs for the same group.
Could you check your consumer instance owners again? If just one group is acting on the topic you should just see just one owner listed. If there are still two, you might have a hung or back ground process. Try at ps -ef | grep java to look for others.
# ps -Aef |grep java
logstash 5487 1 4 18:17 ? 00:05:56 /usr/bin/java -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Djava.awt.headless=true -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -Djava.io.tmpdir=/var/lib/logstash -Xmx2g -Xss2048k -Djffi.boot.library.path=/opt/logstash/vendor/jruby/lib/jni -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Djava.awt.headless=true -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -Djava.io.tmpdir=/var/lib/logstash -XX:HeapDumpPath=/opt/logstash/heapdump.hprof -Xbootclasspath/a:/opt/logstash/vendor/jruby/lib/jruby.jar -classpath : -Djruby.home=/opt/logstash/vendor/jruby -Djruby.lib=/opt/logstash/vendor/jruby/lib -Djruby.script=jruby -Djruby.shell=/bin/sh org.jruby.Main --1.9 /opt/logstash/lib/bootstrap/environment.rb logstash/runner.rb agent -f /etc/logstash/conf.d -l /var/log/logstash/logstash.log
root 8600 4043 0 20:18 pts/0 00:00:00 grep java
I would still check the consumer instance owners again. There isn't much else to explain what you are seeing. If there is just one owner now, you should see the lag count reducing and logs back filling.
Here are the instance owners for two different topic read by the same logstash
Are you expecting instance owners to be different across topics?
that looks better than the first one and is what I would expect. because you have two different kafka inputs for each topic, the owner thread will be named differently. You could use a whitelist filter to have one thread do them all .*_logs would catch both.
To ensure that everything is working as you expect (ie, resuming correctly), I'd shutdown logstash, check the offsets again and make sure lag is accumulating and then start logstash back up. You should see it drain the accumulated lag.
Alright I am going to do that now.
sudo service logstash stop
{:timestamp=>"2016-02-10T21:20:47.316000+0000", :message=>"SIGTERM received. Shutting down the pipeline.", :level=>:warn}
{:timestamp=>"2016-02-10T21:20:52.333000+0000", :level=>:warn, "INFLIGHT_EVENT_COUNT"=>{"total"=>0}, "STALLING_THREADS"=>{["LogStash::Inputs::Kafka", {"zk_connect"=>"zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181", "topic_id"=>"prod_logs", "consumer_threads"=>1, "consumer_restart_sleep_ms"=>100, "decorate_events"=>"true", "type"=>"prod_logs"}]=>[{"thread_id"=>18, "name"=>"<kafka", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-2.0.4/lib/logstash/inputs/kafka.rb:144:in `pop'"}], ["LogStash::Inputs::Kafka", {"zk_connect"=>"zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181", "topic_id"=>"prod_exchange_logs", "consumer_threads"=>1, "consumer_restart_on_error"=>"true", "decorate_events"=>"true", "type"=>"prod_ex_logs"}]=>[{"thread_id"=>19, "name"=>"<kafka", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-2.0.4/lib/logstash/inputs/kafka.rb:144:in `pop'"}], ["LogStash::Inputs::Kafka", {"zk_connect"=>"zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181", "topic_id"=>"staging_logs", "consumer_threads"=>1, "consumer_restart_sleep_ms"=>100, "decorate_events"=>"true", "type"=>"staging_logs"}]=>[{"thread_id"=>20, "name"=>"<kafka", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-2.0.4/lib/logstash/inputs/kafka.rb:144:in `pop'"}], ["LogStash::Inputs::Kafka", {"zk_connect"=>"zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181", "topic_id"=>"staging_exchange_logs", "consumer_threads"=>1, "consumer_restart_sleep_ms"=>100, "decorate_events"=>"true", "type"=>"staging_ex_logs"}]=>[{"thread_id"=>21, "name"=>"<kafka", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-2.0.4/lib/logstash/inputs/kafka.rb:144:in `pop'"}], ["LogStash::Inputs::Syslog", {"type"=>"syslog", "port"=>10000}]=>[{"thread_id"=>22, "name"=>"<syslog", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-syslog-2.0.2/lib/logstash/inputs/syslog.rb:106:in `join'"}]}}
{:timestamp=>"2016-02-10T21:20:52.339000+0000", :message=>"The shutdown process appears to be stalled due to busy or blocked plugins. Check the logs for more information.", :level=>:error}
{:timestamp=>"2016-02-10T21:20:57.328000+0000", :level=>:warn, "INFLIGHT_EVENT_COUNT"=>{"total"=>0}, "STALLING_THREADS"=>{["LogStash::Inputs::Kafka", {"zk_connect"=>"zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181", "topic_id"=>"staging_logs", "consumer_threads"=>1, "consumer_restart_sleep_ms"=>100, "decorate_events"=>"true", "type"=>"staging_logs"}]=>[{"thread_id"=>20, "name"=>"<kafka", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-2.0.4/lib/logstash/inputs/kafka.rb:144:in `pop'"}], ["LogStash::Inputs::Kafka", {"zk_connect"=>"zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181", "topic_id"=>"staging_exchange_logs", "consumer_threads"=>1, "consumer_restart_sleep_ms"=>100, "decorate_events"=>"true", "type"=>"staging_ex_logs"}]=>[{"thread_id"=>21, "name"=>"<kafka", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-2.0.4/lib/logstash/inputs/kafka.rb:144:in `pop'"}], ["LogStash::Inputs::Syslog", {"type"=>"syslog", "port"=>10000}]=>[{"thread_id"=>22, "name"=>"<syslog", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-syslog-2.0.2/lib/logstash/inputs/syslog.rb:106:in `join'"}]}}
{:timestamp=>"2016-02-10T21:21:02.328000+0000", :level=>:warn, "INFLIGHT_EVENT_COUNT"=>{"total"=>0}, "STALLING_THREADS"=>{["LogStash::Inputs::Kafka", {"zk_connect"=>"zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181", "topic_id"=>"staging_logs", "consumer_threads"=>1, "consumer_restart_sleep_ms"=>100, "decorate_events"=>"true", "type"=>"staging_logs"}]=>[{"thread_id"=>20, "name"=>"<kafka", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-2.0.4/lib/logstash/inputs/kafka.rb:144:in `pop'"}], ["LogStash::Inputs::Kafka", {"zk_connect"=>"zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181", "topic_id"=>"staging_exchange_logs", "consumer_threads"=>1, "consumer_restart_sleep_ms"=>100, "decorate_events"=>"true", "type"=>"staging_ex_logs"}]=>[{"thread_id"=>21, "name"=>"<kafka", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-2.0.4/lib/logstash/inputs/kafka.rb:144:in `pop'"}], ["LogStash::Inputs::Syslog", {"type"=>"syslog", "port"=>10000}]=>[{"thread_id"=>22, "name"=>"<syslog", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-syslog-2.0.2/lib/logstash/inputs/syslog.rb:106:in `join'"}]}}
{:timestamp=>"2016-02-10T21:21:02.338000+0000", :message=>"The shutdown process appears to be stalled due to busy or blocked plugins. Check the logs for more information.", :level=>:error}
{:timestamp=>"2016-02-10T21:21:07.326000+0000", :level=>:warn, "INFLIGHT_EVENT_COUNT"=>{"total"=>0}, "STALLING_THREADS"=>{["LogStash::Inputs::Syslog", {"type"=>"syslog", "port"=>10000}]=>[{"thread_id"=>22, "name"=>"<syslog", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-syslog-2.0.2/lib/logstash/inputs/syslog.rb:106:in `join'"}]}}
#ps -Aef |grep java
root 10282 4043 0 21:21 pts/0 00:00:00 grep java
Verified lag was accumulated
Now when I restart the instance the lag is increasing (this is new behavior). and I can only see 5 of the topics
sudo service logstash start
ps -Aef |grep java
logstash 10431 1 65 21:25 ? 00:00:20 /usr/bin/java -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Djava.awt.headless=true -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -Djava.io.tmpdir=/var/lib/logstash -Xmx2g -Xss2048k -Djffi.boot.library.path=/opt/logstash/vendor/jruby/lib/jni -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Djava.awt.headless=true -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -Djava.io.tmpdir=/var/lib/logstash -XX:HeapDumpPath=/opt/logstash/heapdump.hprof -Xbootclasspath/a:/opt/logstash/vendor/jruby/lib/jruby.jar -classpath : -Djruby.home=/opt/logstash/vendor/jruby -Djruby.lib=/opt/logstash/vendor/jruby/lib -Djruby.script=jruby -Djruby.shell=/bin/sh org.jruby.Main --1.9 /opt/logstash/lib/bootstrap/environment.rb logstash/runner.rb agent -f /etc/logstash/conf.d -l /var/log/logstash/logstash.log
root 10568 4043 0 21:26 pts/0 00:00:00 grep java
Now I did a
sudo service logstash restart
Logstash exited but didn't come up
{:timestamp=>"2016-02-10T21:31:49.546000+0000", :message=>"SIGTERM received. Shutting down the pipeline.", :level=>:warn}
{:timestamp=>"2016-02-10T21:31:54.562000+0000", :level=>:warn, "INFLIGHT_EVENT_COUNT"=>{"total"=>0}, "STALLING_THREADS"=>{["LogStash::Inputs::Kafka", {"zk_connect"=>"zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181", "topic_id"=>"staging_exchange_logs", "consumer_threads"=>1, "consumer_restart_sleep_ms"=>100, "decorate_events"=>"true", "type"=>"staging_ex_logs"}]=>[{"thread_id"=>21, "name"=>"<kafka", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-2.0.4/lib/logstash/inputs/kafka.rb:144:in `pop'"}], ["LogStash::Inputs::Syslog", {"type"=>"syslog", "port"=>10000}]=>[{"thread_id"=>22, "name"=>"<syslog", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-syslog-2.0.2/lib/logstash/inputs/syslog.rb:106:in `join'"}]}}
I ran it again
sudo service logstash restart
{:timestamp=>"2016-02-10T21:32:55.478000+0000", :message=>"kafka client threw exception, restarting", :exception=>kafka.common.ConsumerRebalanceFailedException: logstash_logstash.c.rapid-depot-817.internal-1455139966180-37df52d4 can't rebalance after 4 retries, :level=>:warn}
{:timestamp=>"2016-02-10T21:32:55.481000+0000", :message=>"kafka client threw exception, restarting", :exception=>kafka.common.ConsumerRebalanceFailedException: logstash_logstash.c.rapid-depot-817.internal-1455139966137-3db7c519 can't rebalance after 4 retries, :level=>:warn}
{:timestamp=>"2016-02-10T21:32:55.485000+0000", :message=>"kafka client threw exception, restarting", :exception=>kafka.common.ConsumerRebalanceFailedException: logstash_logstash.c.rapid-depot-817.internal-1455139966117-f0fef61f can't rebalance after 4 retries, :level=>:warn}
{:timestamp=>"2016-02-10T21:32:55.497000+0000", :message=>"kafka client threw exception, restarting", :exception=>kafka.common.ConsumerRebalanceFailedException: logstash_logstash.c.rapid-depot-817.internal-1455139966128-1058a8de can't rebalance after 4 retries, :level=>:warn}
{:timestamp=>"2016-02-10T21:32:55.508000+0000", :message=>"kafka client threw exception, restarting", :exception=>kafka.common.ConsumerRebalanceFailedException: logstash_logstash.c.rapid-depot-817.internal-1455139966127-29be3184 can't rebalance after 4 retries, :level=>:warn}
{:timestamp=>"2016-02-10T21:32:55.501000+0000", :message=>"kafka client threw exception, restarting", :exception=>kafka.common.ConsumerRebalanceFailedException: logstash_logstash.c.rapid-depot-817.internal-1455139966137-ac6d391f can't rebalance after 4 retries, :level=>:warn}
Its back up but it has lost all logs
I guess there is some behavior issue with stop & start v/s restart and stop and start don't seem to allow the consumer to continue.
Comparing the offsets, the logs are being consumed as you can see all the consumer offsets have increased. "Lag" even in the low 100's isn't that bad and you will generally always see some lag because it is a thread working across many partitions and you are always adding logs.
If you aren't seeing the logs in Elasticsearch in the timeline where you expect them, I'd guess the date filter isn't firing correctly. Take a close look at those logs because that big bump in logs on your timeseries sure looks like a back filling bump where all the logs are being stamped as they are hit in logstash, hence them only appearing when you run logstash.
You could try tag_on_failure => true
to see if the match is failing.
That would be my guess because the consumer is sucking down the logs per what your metrics report.
I looked at the logs, the date matches the timestampMs field in the json, that bump is part of exchange log traffic noise so that bump is not actually the catchup.
I can assure you that all the logs in between have been lost. We have a backup on docker container and I can see the logs with those timestampss there and not in elastic search.
If there is an indexing failure the logstash logs usually have that info and I cannot see anything there.
This is highly undesirable, the reason we are publishing data to kafka is to ensure the consumers can be taken down and can come back up in an asynchronous fashion. Is this a limitation of this plugin? or simply a configuration issue.
Also where does the kafka input save the kafka position per partition ?
I am using the latest version of the plugin with the following kafka config