tulios / kafkajs

A modern Apache Kafka client for node.js
https://kafka.js.org
MIT License
3.75k stars 527 forks source link

Timeout while acquiring lock (15 waiting locks) #1196

Closed yuvital-render closed 2 years ago

yuvital-render commented 3 years ago

Hi (-:

kafkajs@1.15.0

We are using KafkaJS, it's solid and works great, but sometimes it doesn't. It seems like, sometimes, in small traffic spikes (but not always) we get those errors thrown out when trying to send messages (produce messages):

"KafkaJSLockTimeout Timeout while acquiring lock (15 waiting locks)"

When this happens it's about 30 seconds, and many messages are dropped and never get persisted:

CleanShot 2021-09-29 at 09 03 10@2x

The whole errors looks like this:

CleanShot 2021-09-29 at 09 05 01@2x

We use Heroku, with 3 Large Dynos (In Private Space). We use Confluent fully managed Kafka on a type:standard production cluster. We are using Node with Express.

This is the code that connects to the cluster. Runs once at the start of each Dyno. Only after the connect() we start listening to requests:

await kafkaProducer.connect();

  // only after a successful connection, we start serving requests.

  app.listen(app.get("port"), function () {
    sendLog(`heroku dyno running ${process.env.HEROKU_RELEASE_VERSION} ${process.env.HEROKU_SLUG_COMMIT}`);
    console.log("Node app is running on port", app.get("port"));
  });

This is the configuration of the producer:

import Kafka from 'kafkajs';

const kafka = new Kafka.Kafka({
    brokers: ['pkc-lq8v7.eu-central-1.aws.confluent.cloud:9092'],
    clientId: 'lkc-9dgq0',
    ssl: true,
    sasl: { mechanism: 'plain', password: 'xxxxxx', username: 'xxxxxx' },
});

const producer = kafka.producer();

export const kafkaProducer = producer;

This is the code that sends the messages:

try {
        messages = messages.map(m => { return { value: JSON.stringify(m) } });
        const res = await kafkaProducer.send({ topic, messages });        
    } catch (e) {        
        logException(e, context);
    }

The error in the top is being caught in this catch block.

Those errors here happens on this topic in Confluent:

CleanShot 2021-09-29 at 09 26 19@2x

This topic has 2 partitions. Just raising this up because maybe it's related? The default partitions by Confluent is 6. We chose 2.

So it these errors related to a high burst of traffic, maybe increasing the number of partitions? And if not, could it be the code that connects to cluster? Maybe add some additional retry or other configuration we are missing here?

Any help on the issue and directions for further investigation would be much appreciated! 🙏

yaronlevi commented 3 years ago

This is my post. Posted it by mistake from another account.

tulios commented 3 years ago

Hi @yaronlevi, connections are centrally managed, and there is a lock around it since authentication can require multiple requests. From your picture, it seems that connecting to a particular broker took a while, which caused the timeout. You can try to make the connection timeout a bit more forgiven, but this is an error that you ultimately have to handle. The lock timeout follows this formula:

const lockTimeout = 2 * this.connection.connectionTimeout + this.authenticationTimeout

https://github.com/tulios/kafkajs/blob/9b5ef7f7961fdb6716e2e90634b36abb81d24af9/src/broker/index.js#L64-L66

The default values are optimized for quick connections and smaller payloads, so you might have to adjust them for your setup. From what I've seen, you need higher connection and auth timeouts when using Kafka clusters outside of your cloud provider or network.

yaronlevi commented 3 years ago

@tulios very interesting!

So I've increased the timeout on both parameters to 5000:

CleanShot 2021-09-29 at 12 02 37@2x

I'll keep you updated if it works. Thanks

femi-paystack commented 3 years ago

Did this work?

yaronlevi commented 3 years ago

@femi-paystack Yes! seems to work. The Last error was 2 weeks ago, which is the exact time at which I've made the change:

CleanShot 2021-10-14 at 18 51 18@2x

Nevon commented 2 years ago

Closing since this seems to have just been an issue with connection timeouts. The next release will increase the default authentication timeout.

younsl commented 1 month ago

@tulios very interesting!

So I've increased the timeout on both parameters to 5000:

CleanShot 2021-09-29 at 12 02 37@2x

I'll keep you updated if it works. Thanks

It works well! Thanks a lot, @yaronlevi.

In kafkajs 2.2.3, Try adding connectionTimeout: 5000 and authenticationTimeout: 5000 settings.

Kafka client pod's error message for timeout:

KafkaJSLockTimeout: Timeout while acquiring lock (1865 waiting locks): "connect to broker <REDACTED>:9094"

The error KafkaJSLockTimeout: Timeout while acquiring lock (1865 waiting locks) indicates that the KafkaJS client took too long to acquire a lock needed to perform a specific operation, resulting in a timeout.

lockTimeout calculation formula in kafkajs 2.2.3:

https://github.com/tulios/kafkajs/blob/dcee6971c4a739ebb02d9279f68155e3945c50f7/src/broker/index.js#L52-L54

Solution:

  constructor() {
    const kafka = new Kafka({
      clientId: kafkaClientId,
      brokers: kafkaBrokers,
      ssl: kafkaSSL,
      logLevel: 2,
      connectionTimeout: 5000,     // Add newly
      authenticationTimeout: 5000, // Add newly
    });
    // ... redacted ...
  }

Increase the timeout settings for both the Kafka client and broker to allow more time.