When our Kafka cluster is under load, we sometimes see this error from our kafka.js consumer:
KafkaJSSASLAuthenticationError: SASL SCRAM SHA512 authentication failed: Request SaslAuthenticate(key: 36, version: 1) timed out
at SCRAM.authenticate (/app/node_modules/kafkajs/src/broker/saslAuthenticator/scram.js:158:21)
at async Object.authenticate (/app/node_modules/kafkajs/src/broker/saslAuthenticator/scram512.js:6:31)
at async SASLAuthenticator.authenticate (/app/node_modules/kafkajs/src/broker/saslAuthenticator/index.js:73:5)
at async /app/node_modules/kafkajs/src/network/connection.js:139:9
at async Connection.authenticate (/app/node_modules/kafkajs/src/network/connection.js:315:5)
at async Broker.connect (/app/node_modules/kafkajs/src/broker/index.js:111:7)
at async /app/node_modules/kafkajs/src/cluster/brokerPool.js:319:9
at async BrokerPool.findBroker (/app/node_modules/kafkajs/src/cluster/brokerPool.js:257:5)
at async BrokerPool.withBroker (/app/node_modules/kafkajs/src/cluster/brokerPool.js:274:22)
at async Cluster.findGroupCoordinatorMetadata (/app/node_modules/kafkajs/src/cluster/index.js:385:28)
at async /app/node_modules/kafkajs/src/cluster/index.js:346:33
at async ConsumerGroup.[private:ConsumerGroup:join] (/app/node_modules/kafkajs/src/consumer/consumerGroup.js:167:24)
at async /app/node_modules/kafkajs/src/consumer/consumerGroup.js:335:9
at async Runner.start (/app/node_modules/kafkajs/src/consumer/runner.js:84:7)
at async start (/app/node_modules/kafkajs/src/consumer/index.js:243:7)
at async Object.run (/app/node_modules/kafkajs/src/consumer/index.js:304:5)
This error appears to be classified as a non-retriable error, the CRASH event is emitted from the consumer, with therestart attribute set to false in the event payload. This appears unexpected, because this is the kind of error that should be recoverable by retrying to connect (hoping that the load spike on the broker has subsided and that we won't hit a timeout during authentication again).
My guess is that this issue stems from the fact that instances of KafkaJSSASLAuthenticationError are always classified as KafkaJSNonRetriableError (since it inherits from it here). I suspect that timeouts as part of SASL authentication should be classified differently, they should be retriable by nature (as opposed to cases where the credentials are not accepted).
To Reproduce
We hit high load on our Aiven Kafka cluster during rebalancing events, or when under high load (high message rate). Issue is most frequent when CPU is at or close to 100% for all brokers.
When load is high and we instantiate a new consumer client, we sometimes see the above error get logged and the consumer is unable to start consuming.
Expected behavior
The kafka client consumer should be retrying to connect to the broker when hitting a timeout error during SASL authentication. I understand that kafka.js has retry logic with a maximum nb of attempts, it makes sense that the client would give up reconnecting after hitting this maximum nb of attempts.
Observed behavior
The consumer emits a crash event with the restart attribute set to false. The consumer does not try to reconnect automatically.
Environment:
OS: Debian 11 (bullseye)
KafkaJS version: 2.2.3
Kafka version: 3.2 (using Aiven Kafka with SASL authentication enabled)
NodeJS version: 16.19.1
Additional context
Add any other context about the problem here.
Describe the bug
When our Kafka cluster is under load, we sometimes see this error from our kafka.js consumer:
This error appears to be classified as a non-retriable error, the
CRASH
event is emitted from the consumer, with therestart
attribute set tofalse
in the event payload. This appears unexpected, because this is the kind of error that should be recoverable by retrying to connect (hoping that the load spike on the broker has subsided and that we won't hit a timeout during authentication again).My guess is that this issue stems from the fact that instances of
KafkaJSSASLAuthenticationError
are always classified asKafkaJSNonRetriableError
(since it inherits from it here). I suspect that timeouts as part of SASL authentication should be classified differently, they should be retriable by nature (as opposed to cases where the credentials are not accepted).To Reproduce
We hit high load on our Aiven Kafka cluster during rebalancing events, or when under high load (high message rate). Issue is most frequent when CPU is at or close to 100% for all brokers.
When load is high and we instantiate a new consumer client, we sometimes see the above error get logged and the consumer is unable to start consuming.
Expected behavior
The kafka client consumer should be retrying to connect to the broker when hitting a timeout error during SASL authentication. I understand that kafka.js has retry logic with a maximum nb of attempts, it makes sense that the client would give up reconnecting after hitting this maximum nb of attempts.
Observed behavior
The consumer emits a crash event with the
restart
attribute set tofalse
. The consumer does not try to reconnect automatically.Environment:
Additional context Add any other context about the problem here.