Open godber opened 4 years ago
The current working hypothesis on this issue is that a rebalance comes in while line 62 (below) is trying to disconnect, resulting in an error that is lost and prevents disconnect:
So how should we rety this? There's a _try
in that client there. I don't see pRetry
in use here.
We probably should just log the error, and not propagate it.
It's possible this node-rdkafka issue is causing our issue here:
https://github.com/edenhill/librdkafka/issues/2615
We're going to build a new release of Teraslice that uses
https://github.com/edenhill/librdkafka/releases/tag/v1.4.0
As well as updates the node 12 version (for a memory leak issue not yet reported in github).
@peterdemartini , the upgraded node-rdkafka version still has the 5m shutdown problem AND has stuck workers right off the bat. Here's individual partition lag on a high volume topic
Can you build a v0.69.3
that reverts the node-rdkafka
change?
I am still seeing this issue, but in cases where kafka isn't involved. This is happening in very simple jobs that are data_generator
to noop
. If I have 200 workers, somewhere from 2 to 5 of the pods will sit in terminating for several minutes after attempting to shut down safely. Well beyond the slice generation has completed. Reviewing the logs we can see the termination logs and the slice completion, then it sits there doing nothing.
This issue (kind of) replaces:
https://github.com/terascope/teraslice/issues/942
We have jobs (mostly if not all kafka reader jobs) which take 5 minutes to shutdown. I attempted to fix that with a controlled job shutdown here:
https://github.com/terascope/teraslice/pull/2074
But @macgyver603 has confirmed that the problem still exists. We're going to have to look at the kafka asset now. I don't have any ideas beyond that.