tulios / kafkajs

A modern Apache Kafka client for node.js
https://kafka.js.org
MIT License
3.75k stars 527 forks source link

Unexpected consumer crash upon SASL authentication timeout #1550

Open pmalouin opened 1 year ago

pmalouin commented 1 year ago

Describe the bug

When our Kafka cluster is under load, we sometimes see this error from our kafka.js consumer:

KafkaJSSASLAuthenticationError: SASL SCRAM SHA512 authentication failed: Request SaslAuthenticate(key: 36, version: 1) timed out
    at SCRAM.authenticate (/app/node_modules/kafkajs/src/broker/saslAuthenticator/scram.js:158:21)
    at async Object.authenticate (/app/node_modules/kafkajs/src/broker/saslAuthenticator/scram512.js:6:31)
    at async SASLAuthenticator.authenticate (/app/node_modules/kafkajs/src/broker/saslAuthenticator/index.js:73:5)
    at async /app/node_modules/kafkajs/src/network/connection.js:139:9
    at async Connection.authenticate (/app/node_modules/kafkajs/src/network/connection.js:315:5)
    at async Broker.connect (/app/node_modules/kafkajs/src/broker/index.js:111:7)
    at async /app/node_modules/kafkajs/src/cluster/brokerPool.js:319:9
    at async BrokerPool.findBroker (/app/node_modules/kafkajs/src/cluster/brokerPool.js:257:5)
    at async BrokerPool.withBroker (/app/node_modules/kafkajs/src/cluster/brokerPool.js:274:22)
    at async Cluster.findGroupCoordinatorMetadata (/app/node_modules/kafkajs/src/cluster/index.js:385:28)
    at async /app/node_modules/kafkajs/src/cluster/index.js:346:33
    at async ConsumerGroup.[private:ConsumerGroup:join] (/app/node_modules/kafkajs/src/consumer/consumerGroup.js:167:24)
    at async /app/node_modules/kafkajs/src/consumer/consumerGroup.js:335:9
    at async Runner.start (/app/node_modules/kafkajs/src/consumer/runner.js:84:7)
    at async start (/app/node_modules/kafkajs/src/consumer/index.js:243:7)
    at async Object.run (/app/node_modules/kafkajs/src/consumer/index.js:304:5)

This error appears to be classified as a non-retriable error, the CRASH event is emitted from the consumer, with therestart attribute set to false in the event payload. This appears unexpected, because this is the kind of error that should be recoverable by retrying to connect (hoping that the load spike on the broker has subsided and that we won't hit a timeout during authentication again).

My guess is that this issue stems from the fact that instances of KafkaJSSASLAuthenticationError are always classified as KafkaJSNonRetriableError (since it inherits from it here). I suspect that timeouts as part of SASL authentication should be classified differently, they should be retriable by nature (as opposed to cases where the credentials are not accepted).

To Reproduce

We hit high load on our Aiven Kafka cluster during rebalancing events, or when under high load (high message rate). Issue is most frequent when CPU is at or close to 100% for all brokers.

When load is high and we instantiate a new consumer client, we sometimes see the above error get logged and the consumer is unable to start consuming.

Expected behavior

The kafka client consumer should be retrying to connect to the broker when hitting a timeout error during SASL authentication. I understand that kafka.js has retry logic with a maximum nb of attempts, it makes sense that the client would give up reconnecting after hitting this maximum nb of attempts.

Observed behavior

The consumer emits a crash event with the restart attribute set to false. The consumer does not try to reconnect automatically.

Environment:

Additional context Add any other context about the problem here.

ShradhaKhard commented 3 months ago

Do we have any recommendation on how to avoid this issue?

RotemDoar commented 3 months ago

Is there any workaround until this issue is solved?