terascope / teraslice

Scalable data processing pipelines in JavaScript
https://terascope.github.io/teraslice/
Apache License 2.0
50 stars 13 forks source link

Fix long shutdown times for Teraslice jobs #2106

Open godber opened 4 years ago

godber commented 4 years ago

This issue (kind of) replaces:

https://github.com/terascope/teraslice/issues/942

We have jobs (mostly if not all kafka reader jobs) which take 5 minutes to shutdown. I attempted to fix that with a controlled job shutdown here:

https://github.com/terascope/teraslice/pull/2074

But @macgyver603 has confirmed that the problem still exists. We're going to have to look at the kafka asset now. I don't have any ideas beyond that.

godber commented 4 years ago

The current working hypothesis on this issue is that a rebalance comes in while line 62 (below) is trying to disconnect, resulting in an error that is lost and prevents disconnect:

https://github.com/terascope/kafka-assets/blob/3ea53de6d6c08c799b279904f203ce40da9ac6a4/asset/src/_kafka_clients/base-client.ts#L59-L66

So how should we rety this? There's a _try in that client there. I don't see pRetry in use here.

peterdemartini commented 4 years ago

We probably should just log the error, and not propagate it.

godber commented 4 years ago

It's possible this node-rdkafka issue is causing our issue here:

https://github.com/edenhill/librdkafka/issues/2615

We're going to build a new release of Teraslice that uses

https://github.com/edenhill/librdkafka/releases/tag/v1.4.0

As well as updates the node 12 version (for a memory leak issue not yet reported in github).

peterdemartini commented 4 years ago

Ref: https://github.com/terascope/teraslice/pull/2107

godber commented 4 years ago

@peterdemartini , the upgraded node-rdkafka version still has the 5m shutdown problem AND has stuck workers right off the bat. Here's individual partition lag on a high volume topic

Screen Shot 2020-08-06 at 7 03 30 PM

Can you build a v0.69.3 that reverts the node-rdkafka change?

godber commented 7 months ago

I am still seeing this issue, but in cases where kafka isn't involved. This is happening in very simple jobs that are data_generator to noop. If I have 200 workers, somewhere from 2 to 5 of the pods will sit in terminating for several minutes after attempting to shut down safely. Well beyond the slice generation has completed. Reviewing the logs we can see the termination logs and the slice completion, then it sits there doing nothing.