Closed OPhamster closed 5 years ago
@OPhamster wow that is a high quality bug report! Kudos!
First things first: can you try 1.3.0
?
Let me reproduce that myself today.
Ah sure thing - will try it out and get back to you on this.
@OPhamster I created a simple POC based on the example app. Here's the controller and the logger. Please be aware, that the heartbeat info is logged only in the debug mode.
class App < Karafka::App
setup do |config|
# Karafka will auto-discover kafka_hosts based on Zookeeper but we need it set manually
# to run tests without running kafka and zookeeper
config.kafka.seed_brokers = [ENV['KAFKA_HOST'] || 'kafka://127.0.0.1:9092']
config.client_id = 'example_app'
logger = ::Karafka::Instrumentation::Logger.new
logger.level = 0
config.logger = logger
end
monitor.subscribe('app.initialized') do
WaterDrop.setup { |config| config.deliver = !Karafka.env.test? }
end
end
# frozen_string_literal: true
# Namespace for everything related to our small ping-pong game
module Pong
# Catches the ping and uses PingResponder to respond on a pong topic
class PingConsumer < ApplicationConsumer
# We increase the pings counter and respond
def consume
cclient = client
@a ||= Thread.new do
loop do
begin
cclient.trigger_heartbeat!
rescue Exception => e
p e
end
p 'aaaaaaaaaaaaaa'
sleep(1)
end
end
sleep(60)
counter = params_batch.last.payload['counter'] + 1
# The initial ping needs to be triggered via the rake task
respond_with(counter: counter)
end
end
end
example output where you can see the heartbeat being sent every 1 second
D, [2019-09-12T13:41:59.862854 #32669] DEBUG -- : [heartbeat] Received response 149 from 127.0.0.1:9092
"aaaaaaaaaaaaaa"
D, [2019-09-12T13:41:59.909285 #32669] DEBUG -- : [[example_app_batched_group] {xml_data: 0; inline_batch_data: 0; callbacked_data: 0}:] [fetch] Received response 144 from 127.0.0.1:9092
D, [2019-09-12T13:41:59.909785 #32669] DEBUG -- : [[example_app_batched_group] {xml_data: 0; inline_batch_data: 0; callbacked_data: 0}:] Fetching batches
D, [2019-09-12T13:41:59.910410 #32669] DEBUG -- : [[example_app_batched_group] {xml_data: 0; inline_batch_data: 0; callbacked_data: 0}:] [fetch] Sending fetch API request 145 to 127.0.0.1:9092
D, [2019-09-12T13:41:59.911038 #32669] DEBUG -- : [[example_app_batched_group] {xml_data: 0; inline_batch_data: 0; callbacked_data: 0}:] [fetch] Waiting for response 145 from 127.0.0.1:9092
^CI, [2019-09-12T13:42:00.484529 #32669] INFO -- : Received SIGINT system signal
I, [2019-09-12T13:42:00.484676 #32669] INFO -- : Stopping Karafka server 32669
W, [2019-09-12T13:42:00.708872 #32669] WARN -- : [[example_app_async_pong] {ping: 0; pong: 0}:] Reached max fetcher queue size (10), sleeping 1s
D, [2019-09-12T13:42:00.863116 #32669] DEBUG -- : Sending heartbeat...
D, [2019-09-12T13:42:00.863201 #32669] DEBUG -- : [heartbeat] Sending heartbeat API request 150 to 127.0.0.1:9092
D, [2019-09-12T13:42:00.863290 #32669] DEBUG -- : [heartbeat] Waiting for response 150 from 127.0.0.1:9092
D, [2019-09-12T13:42:00.863755 #32669] DEBUG -- : [heartbeat] Received response 150 from 127.0.0.1:9092
"aaaaaaaaaaaaaa"
D, [2019-09-12T13:42:00.914092 #32669] DEBUG -- : [[example_app_batched_group] {xml_data: 0; inline_batch_data: 0; callbacked_data: 0}:] [fetch] Received response 145 from 127.0.0.1:9092
D, [2019-09-12T13:42:00.914570 #32669] DEBUG -- : [[example_app_batched_group] {xml_data: 0; inline_batch_data: 0; callbacked_data: 0}:] Handling fetcher command: stop
I, [2019-09-12T13:42:00.914700 #32669] INFO -- : [example_app_batched_group] {xml_data: 0; inline_batch_data: 0; callbacked_data: 0}: Fetcher thread exited.
I, [2019-09-12T13:42:00.914898 #32669] INFO -- : [[example_app_batched_group] {}:] Disconnecting broker 1001
D, [2019-09-12T13:42:00.914990 #32669] DEBUG -- : [[example_app_batched_group] {}:] Closing socket to 127.0.0.1:9092
I, [2019-09-12T13:42:00.915126 #32669] INFO -- : [[example_app_batched_group] {}:] Leaving group `example_app_batched_group`
D, [2019-09-12T13:42:00.915207 #32669] DEBUG -- : [[example_app_batched_group] {}:] [leave_group] Opening connection to 127.0.0.1:9092 with client id example_app...
D, [2019-09-12T13:42:00.915356 #32669] DEBUG -- : [[example_app_batched_group] {}:] [leave_group] Sending leave_group API request 1 to 127.0.0.1:9092
D, [2019-09-12T13:42:00.915456 #32669] DEBUG -- : [[example_app_batched_group] {}:] [leave_group] Waiting for response 1 from 127.0.0.1:9092
D, [2019-09-12T13:42:00.916666 #32669] DEBUG -- : [[example_app_batched_group] {}:] [leave_group] Received response 1 from 127.0.0.1:9092
W, [2019-09-12T13:42:01.709037 #32669] WARN -- : [[example_app_async_pong] {ping: 0; pong: 0}:] Reached max fetcher queue size (10), sleeping 1s
Ah - probably something wrong on my end. To be clear - this is using 1.3.0
right ? I don't think I recognize payload
as a callable method. If so - Then I'll update the code to use this version and get back on it. Thanks :+1:
@OPhamster yes. The example app is 1.3.0
. Upgrade is fairly simple. Here you have the changelog + upgrade notes: https://mensfeld.pl/2019/09/karafka-framework-1-3-0-release-notes-ruby-kafka/
Please close once works. If problem still persists, ping me and I will do my best to help you out!
Closing due to lack of activity (and it seems to work).
Background
We use karafka to process messages in consumers that interact with 2 separate databases. In certain cases - due to the way we handle the operations defined in the messages - the consumer is stuck waiting for the database(s) to complete an expensive update. Ultimately the problem lies with each message not translating to exactly one operation - but changing this would be extremely time consuming. Its difficult to estimate the cost of a single message as this changes with time. So as an alternative we decided to use the exposed consumer heartbeat to send heartbeats periodically to signify that the consumer was alive even though it's stuck processing the same batch.
Expected behavior
Expected the consumers to send heartbeats periodically inspite of the consumer not having finished processing the params_batch within the
session_timeout
and not be rebalanced every so often when an expensive message(s) comes along.Actual behavior
The consumer doesn't seem to send hearbeats when we trigger them manually (it does not seem to show up in the logs). This causes the consumers to respond late and commit an offset + send a heartbeat after kafka has decided that the consumer is dead and triggered a rebalance. This causes the cluster to be stuck in a barrage of rebalance operations for some time.
Steps to reproduce the problem
The code given below is the asynchronous heartbeat controller that we've implemented to handle sending the heartbeats in a separate timed thread - ignoring whatever the state of the consumer is. We have only instance of this controller start up when the karafka server starts up.
Your setup details