tulios / kafkajs

A modern Apache Kafka client for node.js
https://kafka.js.org
MIT License
3.73k stars 526 forks source link

Intermittent Connection Timeouts with Lambda Producer using IAM Auth #1435

Open Borduhh opened 2 years ago

Borduhh commented 2 years ago

Describe the bug We see inconsistent connection/authentication timeouts when using AWS Lambda Producers via KafkaJS and MSK.

To Reproduce

NOTE: We can consistently produce the error in our technical stack but cannot produce the error using a minimum repo

  1. Deploy a Lambda Producer triggered by API Gateway and MSK via CDK
  2. Invoke the route a few times

Example Lambda Code:

export const handler = async (
  event: Event,
  context: Context
): Promise<AWSLambda.APIGatewayProxyResultV2> => {

  await context.kafkaProducer.send({
    topic: process.env.TOPIC || '',
    messages: [{ value: buffer }],
  });

  return SuccessResponse({ statusCode: 204 });
};

export const main = middy(handler);

main
  .use(
    kafkaProducerMiddleware({
      clientId: 'network-admin.location.usage.requested',
      brokers: process.env.BROKER_CONNECTION_STRING?.split(',') || [],
      ssl: true,
      connectionTimeout: 10000,
      sasl: {
        mechanism: Type,
        authenticationProvider: awsIamAuthenticator(process.env.REGION || '', process.env.TTL),
      },
    })
  );

Example Kafka Middleware:

import middy from '@middy/core';
import { Kafka, KafkaConfig, Producer } from 'kafkajs';

type Options = KafkaConfig;

export type KafkaProducerContext = AWSLambda.Context & { kafkaProducer: Producer };

export const kafkaProducerMiddleware = (opts: Options): middy.MiddlewareObj => {
  const kafka = new Kafka(opts);
  const producer = kafka.producer();

  const before: middy.MiddlewareFn = async (request): Promise<void> => {
    /**
     * Kafkajs manages connections under the hood
     * so calling connect here doesn't hurt i.e creating endless connections
     */
    await producer.connect();

    Object.assign(request.context, { kafkaProducer: producer });
  };

  return {
    before,
    after,
  };
};

Expected behavior The producer should connect to the MSK cluster without an issue.

Observed behavior Lambda Logs:

START   RequestId: 77f245e6-58f6-4755-a4b2-829f9d52aa6a Version: $LATEST

2022-08-12T18:01:44.460Z    77f245e6-58f6-4755-a4b2-829f9d52aa6a    WARN    {"level":"WARN","timestamp":"2022-08-12T18:01:44.460Z","logger":"kafkajs","message":"KafkaJS   v2.0.0 switched default partitioner. To retain the same partitioning behavior   as in previous versions, create the producer with the option   \"createPartitioner: Partitioners.LegacyPartitioner\". See the   migration guide at https://kafka.js.org/docs/migration-guide-v2.0.0#producer-new-default-partitioner   for details. Silence this warning by setting the environment variable   \"KAFKAJS_NO_PARTITIONER_WARNING=1\""}

2022-08-12T18:01:54.531Z    77f245e6-58f6-4755-a4b2-829f9d52aa6a    ERROR   {"level":"ERROR","timestamp":"2022-08-12T18:01:54.531Z","logger":"kafkajs","message":"[Connection]   Connection timeout","broker":"b-1.niacincluster.9neyhl.c24.kafka.us-east-1.amazonaws.com:9098","clientId":"kafkajs"}

2022-08-12T18:01:54.533Z    77f245e6-58f6-4755-a4b2-829f9d52aa6a    ERROR   {"level":"ERROR","timestamp":"2022-08-12T18:01:54.533Z","logger":"kafkajs","message":"[BrokerPool]   Failed to connect to seed broker, trying another broker from the list:   Connection   timeout","retryCount":0,"retryTime":336}

2022-08-12T18:02:04.887Z    77f245e6-58f6-4755-a4b2-829f9d52aa6a    ERROR   {"level":"ERROR","timestamp":"2022-08-12T18:02:04.887Z","logger":"kafkajs","message":"[Connection]   Connection timeout","broker":"b-2.niacincluster.9neyhl.c24.kafka.us-east-1.amazonaws.com:9098","clientId":"kafkajs"}

2022-08-12T18:02:04.888Z    77f245e6-58f6-4755-a4b2-829f9d52aa6a    ERROR   {"level":"ERROR","timestamp":"2022-08-12T18:02:04.888Z","logger":"kafkajs","message":"[BrokerPool]   Failed to connect to seed broker, trying another broker from the list:   Connection   timeout","retryCount":1,"retryTime":764}

2022-08-12T18:02:06.404Z    77f245e6-58f6-4755-a4b2-829f9d52aa6a    ERROR   {"level":"ERROR","timestamp":"2022-08-12T18:02:06.404Z","logger":"kafkajs","message":"[Connection]   Connection error: Client network socket disconnected before secure TLS   connection was   established","broker":"b-2.niacincluster.9neyhl.c24.kafka.us-east-1.amazonaws.com:9098","clientId":"kafkajs","stack":"Error:   Client network socket disconnected before secure TLS connection was   established         at connResetException   (node:internal/errors:692:14)         at TLSSocket.onConnectEnd   (node:_tls_wrap:1587:19)         at TLSSocket.emit   (node:events:539:35)         at endReadableNT   (node:internal/streams/readable:1345:12)         at processTicksAndRejections   (node:internal/process/task_queues:83:21)"}

2022-08-12T18:02:06.558Z    77f245e6-58f6-4755-a4b2-829f9d52aa6a    ERROR   {"level":"ERROR","timestamp":"2022-08-12T18:02:06.558Z","logger":"kafkajs","message":"[Connection]   Connection error: Client network socket disconnected before secure TLS   connection was   established","broker":"b-1.niacincluster.9neyhl.c24.kafka.us-east-1.amazonaws.com:9098","clientId":"kafkajs","stack":"Error:   Client network socket disconnected before secure TLS connection was   established         at connResetException   (node:internal/errors:692:14)         at TLSSocket.onConnectEnd   (node:_tls_wrap:1587:19)         at TLSSocket.emit   (node:events:539:35)         at endReadableNT   (node:internal/streams/readable:1345:12)         at processTicksAndRejections   (node:internal/process/task_queues:83:21)"}

END   RequestId: 77f245e6-58f6-4755-a4b2-829f9d52aa6a

REPORT   RequestId: 77f245e6-58f6-4755-a4b2-829f9d52aa6a    Duration: 24035.68 ms   Billed   Duration: 24000 ms Memory Size: 128 MB Max Memory Used: 33 MB

2022-08-12T18:02:07.660Z   77f245e6-58f6-4755-a4b2-829f9d52aa6a Task timed out after 24.04 seconds

MSK Broker Logs:

[2022-08-12   18:30:01,759] INFO [SocketServer listenerType=ZK_BROKER, nodeId=1] Failed   authentication with /INTERNAL_IP (channelId=INTERNAL_IP-INTERNAL_IP-884)   ([71959561-3f92-4970-80ef-62bae25d6d80]: Internal error)   (org.apache.kafka.common.network.Selector)
--
[2022-08-12 18:30:03,300] INFO Kafka   version: 3.2.0 (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:30:03,300] INFO Kafka   commitId: unknown (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:30:03,300] INFO Kafka   startTimeMs: 1660329003300 (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:30:03,705] INFO Kafka   version: 3.2.0 (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:30:03,705] INFO Kafka   commitId: unknown (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:30:03,705] INFO Kafka   startTimeMs: 1660329003705 (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:30:08,193] INFO App info   kafka.admin.client for adminclient-246 unregistered   (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:30:08,193] INFO Metrics   scheduler closed (org.apache.kafka.common.metrics.Metrics)
[2022-08-12 18:30:08,193] INFO Closing   reporter org.apache.kafka.common.metrics.JmxReporter   (org.apache.kafka.common.metrics.Metrics)
[2022-08-12 18:30:08,194] INFO Metrics   reporters closed (org.apache.kafka.common.metrics.Metrics)
[2022-08-12 18:30:08,196] INFO Metrics   scheduler closed (org.apache.kafka.common.metrics.Metrics)
[2022-08-12 18:30:08,196] INFO Closing   reporter org.apache.kafka.common.metrics.JmxReporter   (org.apache.kafka.common.metrics.Metrics)
[2022-08-12 18:30:08,196] INFO Metrics   reporters closed (org.apache.kafka.common.metrics.Metrics)
[2022-08-12 18:30:08,196] INFO App info   kafka.consumer for consumer-consumer-lag-246 unregistered   (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:30:17,770] INFO Kafka   version: 3.2.0 (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:30:17,770] INFO Kafka   commitId: unknown (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:30:17,770] INFO Kafka   startTimeMs: 1660329017770 (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:30:17,777] INFO Kafka   version: 3.2.0 (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:30:17,778] INFO Kafka   commitId: unknown (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:30:17,778] INFO Kafka   startTimeMs: 1660329017777 (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:30:21,861] INFO App info   kafka.admin.client for adminclient-246 unregistered   (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:30:21,861] INFO Metrics   scheduler closed (org.apache.kafka.common.metrics.Metrics)
[2022-08-12 18:30:21,862] INFO Closing   reporter org.apache.kafka.common.metrics.JmxReporter   (org.apache.kafka.common.metrics.Metrics)
[2022-08-12 18:30:21,862] INFO Metrics   reporters closed (org.apache.kafka.common.metrics.Metrics)
[2022-08-12 18:30:21,862] INFO Metrics   scheduler closed (org.apache.kafka.common.metrics.Metrics)
[2022-08-12 18:30:21,862] INFO Closing   reporter org.apache.kafka.common.metrics.JmxReporter   (org.apache.kafka.common.metrics.Metrics)
[2022-08-12 18:30:21,862] INFO Metrics   reporters closed (org.apache.kafka.common.metrics.Metrics)
[2022-08-12 18:30:21,863] INFO App info   kafka.consumer for consumer-consumer-lag-246 unregistered   (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:30:30,696] INFO   [GroupCoordinator 2]: Dynamic member with unknown member id joins group   amazon.msk.canary.group.broker-2 in Empty state. Created a new member id   consumer-amazon.msk.canary.group.broker-2-250-7d92cb9c-d61b-429a-bd70-f5e3ddcd8bc5   and request the member to rejoin with this id.   (kafka.coordinator.group.GroupCoordinator)
[2022-08-12 18:30:30,697] INFO   [GroupCoordinator 2]: Preparing to rebalance group   amazon.msk.canary.group.broker-2 in state PreparingRebalance with old   generation 496 (__consumer_offsets-20) (reason: Adding new member   consumer-amazon.msk.canary.group.broker-2-250-7d92cb9c-d61b-429a-bd70-f5e3ddcd8bc5   with group instance id None; client reason: not provided)   (kafka.coordinator.group.GroupCoordinator)
[2022-08-12 18:30:33,359] INFO   [GroupCoordinator 1]: Dynamic member with unknown member id joins group   amazon.msk.canary.group.broker-1 in Empty state. Created a new member id   consumer-amazon.msk.canary.group.broker-1-249-67fa4890-cbf1-416f-b7be-b956392889ec   and request the member to rejoin with this id.   (kafka.coordinator.group.GroupCoordinator)
[2022-08-12 18:30:33,362] INFO   [GroupCoordinator 1]: Preparing to rebalance group   amazon.msk.canary.group.broker-1 in state PreparingRebalance with old   generation 496 (__consumer_offsets-21) (reason: Adding new member   consumer-amazon.msk.canary.group.broker-1-249-67fa4890-cbf1-416f-b7be-b956392889ec   with group instance id None; client reason: not provided)   (kafka.coordinator.group.GroupCoordinator)
[2022-08-12 18:30:33,697] INFO   [GroupCoordinator 2]: Stabilized group amazon.msk.canary.group.broker-2   generation 497 (__consumer_offsets-20) with 1 members   (kafka.coordinator.group.GroupCoordinator)
[2022-08-12 18:30:33,790] INFO   [GroupCoordinator 2]: Assignment received from leader   consumer-amazon.msk.canary.group.broker-2-250-7d92cb9c-d61b-429a-bd70-f5e3ddcd8bc5   for group amazon.msk.canary.group.broker-2 for generation 497. The group has   1 members, 0 of which are static. (kafka.coordinator.group.GroupCoordinator)
[2022-08-12 18:30:35,094] INFO   [GroupCoordinator 2]: Preparing to rebalance group   amazon.msk.canary.group.broker-2 in state PreparingRebalance with old   generation 497 (__consumer_offsets-20) (reason: Removing member   consumer-amazon.msk.canary.group.broker-2-250-7d92cb9c-d61b-429a-bd70-f5e3ddcd8bc5   on LeaveGroup; client reason: not provided)   (kafka.coordinator.group.GroupCoordinator)
[2022-08-12 18:30:35,094] INFO   [GroupCoordinator 2]: Group amazon.msk.canary.group.broker-2 with generation   498 is now empty (__consumer_offsets-20)   (kafka.coordinator.group.GroupCoordinator)
[2022-08-12 18:30:35,094] INFO   [GroupCoordinator 2]: Member   MemberMetadata(memberId=consumer-amazon.msk.canary.group.broker-2-250-7d92cb9c-d61b-429a-bd70-f5e3ddcd8bc5,   groupInstanceId=None, clientId=consumer-amazon.msk.canary.group.broker-2-250,   clientHost=/INTERNAL_IP, sessionTimeoutMs=10000, rebalanceTimeoutMs=300000,   supportedProtocols=List(range)) has left group   amazon.msk.canary.group.broker-2 through explicit `LeaveGroup`; client   reason: not provided (kafka.coordinator.group.GroupCoordinator)
[2022-08-12 18:30:36,363] INFO   [GroupCoordinator 1]: Stabilized group amazon.msk.canary.group.broker-1   generation 497 (__consumer_offsets-21) with 1 members   (kafka.coordinator.group.GroupCoordinator)
[2022-08-12 18:30:36,374] INFO   [GroupCoordinator 1]: Assignment received from leader   consumer-amazon.msk.canary.group.broker-1-249-67fa4890-cbf1-416f-b7be-b956392889ec   for group amazon.msk.canary.group.broker-1 for generation 497. The group has   1 members, 0 of which are static. (kafka.coordinator.group.GroupCoordinator)
[2022-08-12 18:30:39,068] INFO   [GroupCoordinator 1]: Preparing to rebalance group   amazon.msk.canary.group.broker-1 in state PreparingRebalance with old   generation 497 (__consumer_offsets-21) (reason: Removing member   consumer-amazon.msk.canary.group.broker-1-249-67fa4890-cbf1-416f-b7be-b956392889ec   on LeaveGroup; client reason: not provided)   (kafka.coordinator.group.GroupCoordinator)
[2022-08-12 18:30:39,068] INFO   [GroupCoordinator 1]: Group amazon.msk.canary.group.broker-1 with generation   498 is now empty (__consumer_offsets-21)   (kafka.coordinator.group.GroupCoordinator)
[2022-08-12 18:30:39,069] INFO   [GroupCoordinator 1]: Member   MemberMetadata(memberId=consumer-amazon.msk.canary.group.broker-1-249-67fa4890-cbf1-416f-b7be-b956392889ec,   groupInstanceId=None, clientId=consumer-amazon.msk.canary.group.broker-1-249,   clientHost=/INTERNAL_IP, sessionTimeoutMs=10000, rebalanceTimeoutMs=300000,   supportedProtocols=List(range)) has left group   amazon.msk.canary.group.broker-1 through explicit `LeaveGroup`; client   reason: not provided (kafka.coordinator.group.GroupCoordinator)
[2022-08-12 18:30:54,461] INFO   [SocketServer listenerType=ZK_BROKER, nodeId=1] Failed authentication with   /INTERNAL_IP (channelId=INTERNAL_IP-INTERNAL_IP-891)   ([81178052-a07f-489c-8f3a-eb9e5054543a]: Internal error)   (org.apache.kafka.common.network.Selector)
[2022-08-12 18:31:08,296] INFO Kafka   version: 3.2.0 (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:31:08,296] INFO Kafka   commitId: unknown (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:31:08,296] INFO Kafka   startTimeMs: 1660329068296 (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:31:08,305] INFO Kafka   version: 3.2.0 (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:31:08,305] INFO Kafka   commitId: unknown (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:31:08,305] INFO Kafka   startTimeMs: 1660329068305 (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:31:11,898] INFO App info   kafka.admin.client for adminclient-247 unregistered   (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:31:11,899] INFO Metrics   scheduler closed (org.apache.kafka.common.metrics.Metrics)
[2022-08-12 18:31:11,899] INFO Closing   reporter org.apache.kafka.common.metrics.JmxReporter   (org.apache.kafka.common.metrics.Metrics)
[2022-08-12 18:31:11,899] INFO Metrics   reporters closed (org.apache.kafka.common.metrics.Metrics)
[2022-08-12 18:31:11,900] INFO Metrics   scheduler closed (org.apache.kafka.common.metrics.Metrics)
[2022-08-12 18:31:11,900] INFO Closing   reporter org.apache.kafka.common.metrics.JmxReporter   (org.apache.kafka.common.metrics.Metrics)
[2022-08-12 18:31:11,900] INFO Metrics   reporters closed (org.apache.kafka.common.metrics.Metrics)
[2022-08-12 18:31:11,901] INFO App info   kafka.consumer for consumer-consumer-lag-247 unregistered   (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:31:22,065] INFO Kafka   version: 3.2.0 (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:31:22,065] INFO Kafka   commitId: unknown (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:31:22,065] INFO Kafka   startTimeMs: 1660329082065 (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:31:22,075] INFO Kafka   version: 3.2.0 (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:31:22,075] INFO Kafka   commitId: unknown (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:31:22,075] INFO Kafka   startTimeMs: 1660329082075 (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:31:25,369] INFO App info   kafka.admin.client for adminclient-247 unregistered   (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:31:25,370] INFO Metrics   scheduler closed (org.apache.kafka.common.metrics.Metrics)
[2022-08-12 18:31:25,370] INFO Closing   reporter org.apache.kafka.common.metrics.JmxReporter   (org.apache.kafka.common.metrics.Metrics)
[2022-08-12 18:31:25,370] INFO Metrics   reporters closed (org.apache.kafka.common.metrics.Metrics)
[2022-08-12 18:31:25,371] INFO Metrics   scheduler closed (org.apache.kafka.common.metrics.Metrics)
[2022-08-12 18:31:25,371] INFO Closing   reporter org.apache.kafka.common.metrics.JmxReporter   (org.apache.kafka.common.metrics.Metrics)
[2022-08-12 18:31:25,371] INFO Metrics   reporters closed (org.apache.kafka.common.metrics.Metrics)
[2022-08-12 18:31:25,372] INFO App info   kafka.consumer for consumer-consumer-lag-247 unregistered   (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:31:31,062] INFO   [GroupCoordinator 1]: Dynamic member with unknown member id joins group   amazon.msk.canary.group.broker-1 in Empty state. Created a new member id   consumer-amazon.msk.canary.group.broker-1-250-209fd432-abc8-4f32-8950-6340b4cd6674   and request the member to rejoin with this id.   (kafka.coordinator.group.GroupCoordinator)
[2022-08-12 18:31:31,070] INFO   [GroupCoordinator 1]: Preparing to rebalance group   amazon.msk.canary.group.broker-1 in state PreparingRebalance with old   generation 498 (__consumer_offsets-21) (reason: Adding new member   consumer-amazon.msk.canary.group.broker-1-250-209fd432-abc8-4f32-8950-6340b4cd6674   with group instance id None; client reason: not provided)   (kafka.coordinator.group.GroupCoordinator)
[2022-08-12 18:31:31,896] INFO   [GroupCoordinator 2]: Dynamic member with unknown member id joins group   amazon.msk.canary.group.broker-2 in Empty state. Created a new member id   consumer-amazon.msk.canary.group.broker-2-251-8c49f628-20e8-4655-80be-2c8b69667e4c   and request the member to rejoin with this id.   (kafka.coordinator.group.GroupCoordinator)
[2022-08-12 18:31:31,899] INFO   [GroupCoordinator 2]: Preparing to rebalance group   amazon.msk.canary.group.broker-2 in state PreparingRebalance with old   generation 498 (__consumer_offsets-20) (reason: Adding new member   consumer-amazon.msk.canary.group.broker-2-251-8c49f628-20e8-4655-80be-2c8b69667e4c   with group instance id None; client reason: not provided)   (kafka.coordinator.group.GroupCoordinator)
[2022-08-12 18:31:34,070] INFO   [GroupCoordinator 1]: Stabilized group amazon.msk.canary.group.broker-1   generation 499 (__consumer_offsets-21) with 1 members   (kafka.coordinator.group.GroupCoordinator)
[2022-08-12 18:31:34,159] INFO   [GroupCoordinator 1]: Assignment received from leader   consumer-amazon.msk.canary.group.broker-1-250-209fd432-abc8-4f32-8950-6340b4cd6674   for group amazon.msk.canary.group.broker-1 for generation 499. The group has   1 members, 0 of which are static. (kafka.coordinator.group.GroupCoordinator)
[2022-08-12 18:31:35,364] INFO   [GroupCoordinator 1]: Preparing to rebalance group   amazon.msk.canary.group.broker-1 in state PreparingRebalance with old   generation 499 (__consumer_offsets-21) (reason: Removing member   consumer-amazon.msk.canary.group.broker-1-250-209fd432-abc8-4f32-8950-6340b4cd6674   on LeaveGroup; client reason: not provided)   (kafka.coordinator.group.GroupCoordinator)
[2022-08-12 18:31:35,364] INFO   [GroupCoordinator 1]: Group amazon.msk.canary.group.broker-1 with generation   500 is now empty (__consumer_offsets-21)   (kafka.coordinator.group.GroupCoordinator)
[2022-08-12 18:31:35,364] INFO   [GroupCoordinator 1]: Member   MemberMetadata(memberId=consumer-amazon.msk.canary.group.broker-1-250-209fd432-abc8-4f32-8950-6340b4cd6674,   groupInstanceId=None, clientId=consumer-amazon.msk.canary.group.broker-1-250,   clientHost=/INTERNAL_IP, sessionTimeoutMs=10000, rebalanceTimeoutMs=300000,   supportedProtocols=List(range)) has left group   amazon.msk.canary.group.broker-1 through explicit `LeaveGroup`; client   reason: not provided (kafka.coordinator.group.GroupCoordinator)
[2022-08-12 18:31:34,899] INFO   [GroupCoordinator 2]: Stabilized group amazon.msk.canary.group.broker-2   generation 499 (__consumer_offsets-20) with 1 members   (kafka.coordinator.group.GroupCoordinator)
[2022-08-12 18:31:34,906] INFO   [GroupCoordinator 2]: Assignment received from leader   consumer-amazon.msk.canary.group.broker-2-251-8c49f628-20e8-4655-80be-2c8b69667e4c   for group amazon.msk.canary.group.broker-2 for generation 499. The group has   1 members, 0 of which are static. (kafka.coordinator.group.GroupCoordinator)
[2022-08-12 18:31:36,700] INFO   [GroupCoordinator 2]: Preparing to rebalance group   amazon.msk.canary.group.broker-2 in state PreparingRebalance with old   generation 499 (__consumer_offsets-20) (reason: Removing member   consumer-amazon.msk.canary.group.broker-2-251-8c49f628-20e8-4655-80be-2c8b69667e4c   on LeaveGroup; client reason: not provided)   (kafka.coordinator.group.GroupCoordinator)
[2022-08-12 18:31:36,700] INFO   [GroupCoordinator 2]: Group amazon.msk.canary.group.broker-2 with generation   500 is now empty (__consumer_offsets-20)   (kafka.coordinator.group.GroupCoordinator)
[2022-08-12 18:31:36,701] INFO   [GroupCoordinator 2]: Member   MemberMetadata(memberId=consumer-amazon.msk.canary.group.broker-2-251-8c49f628-20e8-4655-80be-2c8b69667e4c,   groupInstanceId=None, clientId=consumer-amazon.msk.canary.group.broker-2-251,   clientHost=/INTERNAL_IP, sessionTimeoutMs=10000, rebalanceTimeoutMs=300000,   supportedProtocols=List(range)) has left group   amazon.msk.canary.group.broker-2 through explicit `LeaveGroup`; client   reason: not provided (kafka.coordinator.group.GroupCoordinator)
[2022-08-12 18:32:12,008] INFO Kafka   version: 3.2.0 (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:32:12,008] INFO Kafka   commitId: unknown (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:32:12,008] INFO Kafka   startTimeMs: 1660329132008 (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:32:12,094] INFO Kafka   version: 3.2.0 (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:32:12,094] INFO Kafka   commitId: unknown (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:32:12,094] INFO Kafka   startTimeMs: 1660329132094 (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:32:16,290] INFO App info   kafka.admin.client for adminclient-248 unregistered   (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:32:16,291] INFO Metrics   scheduler closed (org.apache.kafka.common.metrics.Metrics)
[2022-08-12 18:32:16,291] INFO Closing   reporter org.apache.kafka.common.metrics.JmxReporter   (org.apache.kafka.common.metrics.Metrics)
[2022-08-12 18:32:16,291] INFO Metrics   reporters closed (org.apache.kafka.common.metrics.Metrics)
[2022-08-12 18:32:16,292] INFO Metrics   scheduler closed (org.apache.kafka.common.metrics.Metrics)
[2022-08-12 18:32:16,292] INFO Closing   reporter org.apache.kafka.common.metrics.JmxReporter   (org.apache.kafka.common.metrics.Metrics)
[2022-08-12 18:32:16,292] INFO Metrics   reporters closed (org.apache.kafka.common.metrics.Metrics)
[2022-08-12 18:32:16,293] INFO App info   kafka.consumer for consumer-consumer-lag-248 unregistered   (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:32:25,464] INFO Kafka   version: 3.2.0 (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:32:25,464] INFO Kafka   commitId: unknown (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:32:25,464] INFO Kafka   startTimeMs: 1660329145464 (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:32:25,470] INFO Kafka   version: 3.2.0 (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:32:25,470] INFO Kafka   commitId: unknown (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:32:25,470] INFO Kafka   startTimeMs: 1660329145470 (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:32:30,161] INFO App info   kafka.admin.client for adminclient-248 unregistered   (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:32:30,162] INFO Metrics   scheduler closed (org.apache.kafka.common.metrics.Metrics)
[2022-08-12 18:32:30,162] INFO Closing   reporter org.apache.kafka.common.metrics.JmxReporter   (org.apache.kafka.common.metrics.Metrics)
[2022-08-12 18:32:30,162] INFO Metrics   reporters closed (org.apache.kafka.common.metrics.Metrics)
[2022-08-12 18:32:30,162] INFO Metrics   scheduler closed (org.apache.kafka.common.metrics.Metrics)
[2022-08-12 18:32:30,162] INFO Closing   reporter org.apache.kafka.common.metrics.JmxReporter   (org.apache.kafka.common.metrics.Metrics)
[2022-08-12 18:32:30,162] INFO Metrics   reporters closed (org.apache.kafka.common.metrics.Metrics)
[2022-08-12 18:32:30,163] INFO App info   kafka.consumer for consumer-consumer-lag-248 unregistered   (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:32:31,093] INFO   [GroupCoordinator 2]: Dynamic member with unknown member id joins group   amazon.msk.canary.group.broker-2 in Empty state. Created a new member id   consumer-amazon.msk.canary.group.broker-2-252-74ea29cd-7118-4701-9525-ca3932c1c019   and request the member to rejoin with this id.   (kafka.coordinator.group.GroupCoordinator)
[2022-08-12 18:32:31,099] INFO   [GroupCoordinator 2]: Preparing to rebalance group   amazon.msk.canary.group.broker-2 in state PreparingRebalance with old   generation 500 (__consumer_offsets-20) (reason: Adding new member   consumer-amazon.msk.canary.group.broker-2-252-74ea29cd-7118-4701-9525-ca3932c1c019   with group instance id None; client reason: not provided)   (kafka.coordinator.group.GroupCoordinator)
[2022-08-12 18:32:31,873] INFO   [GroupCoordinator 1]: Dynamic member with unknown member id joins group   amazon.msk.canary.group.broker-1 in Empty state. Created a new member id   consumer-amazon.msk.canary.group.broker-1-251-4e197de4-9adc-4f12-91d4-046b15efec6d   and request the member to rejoin with this id.   (kafka.coordinator.group.GroupCoordinator)
[2022-08-12 18:32:31,962] INFO   [GroupCoordinator 1]: Preparing to rebalance group   amazon.msk.canary.group.broker-1 in state PreparingRebalance with old   generation 500 (__consumer_offsets-21) (reason: Adding new member   consumer-amazon.msk.canary.group.broker-1-251-4e197de4-9adc-4f12-91d4-046b15efec6d   with group instance id None; client reason: not provided)   (kafka.coordinator.group.GroupCoordinator)
[2022-08-12 18:32:34,099] INFO   [GroupCoordinator 2]: Stabilized group amazon.msk.canary.group.broker-2   generation 501 (__consumer_offsets-20) with 1 members   (kafka.coordinator.group.GroupCoordinator)
[2022-08-12 18:32:34,103] INFO   [GroupCoordinator 2]: Assignment received from leader   consumer-amazon.msk.canary.group.broker-2-252-74ea29cd-7118-4701-9525-ca3932c1c019   for group amazon.msk.canary.group.broker-2 for generation 501. The group has   1 members, 0 of which are static. (kafka.coordinator.group.GroupCoordinator)
[2022-08-12 18:32:34,961] INFO   [GroupCoordinator 1]: Stabilized group amazon.msk.canary.group.broker-1   generation 501 (__consumer_offsets-21) with 1 members   (kafka.coordinator.group.GroupCoordinator)
[2022-08-12 18:32:35,058] INFO   [GroupCoordinator 1]: Assignment received from leader   consumer-amazon.msk.canary.group.broker-1-251-4e197de4-9adc-4f12-91d4-046b15efec6d   for group amazon.msk.canary.group.broker-1 for generation 501. The group has   1 members, 0 of which are static. (kafka.coordinator.group.GroupCoordinator)
[2022-08-12 18:32:36,564] INFO   [GroupCoordinator 1]: Preparing to rebalance group   amazon.msk.canary.group.broker-1 in state PreparingRebalance with old   generation 501 (__consumer_offsets-21) (reason: Removing member   consumer-amazon.msk.canary.group.broker-1-251-4e197de4-9adc-4f12-91d4-046b15efec6d   on LeaveGroup; client reason: not provided)   (kafka.coordinator.group.GroupCoordinator)
[2022-08-12 18:32:36,564] INFO   [GroupCoordinator 1]: Group amazon.msk.canary.group.broker-1 with generation   502 is now empty (__consumer_offsets-21)   (kafka.coordinator.group.GroupCoordinator)
[2022-08-12 18:32:36,564] INFO   [GroupCoordinator 1]: Member   MemberMetadata(memberId=consumer-amazon.msk.canary.group.broker-1-251-4e197de4-9adc-4f12-91d4-046b15efec6d,   groupInstanceId=None, clientId=consumer-amazon.msk.canary.group.broker-1-251,   clientHost=/INTERNAL_IP, sessionTimeoutMs=10000, rebalanceTimeoutMs=300000,   supportedProtocols=List(range)) has left group   amazon.msk.canary.group.broker-1 through explicit `LeaveGroup`; client   reason: not provided (kafka.coordinator.group.GroupCoordinator)
[2022-08-12 18:32:36,194] INFO   [GroupCoordinator 2]: Preparing to rebalance group   amazon.msk.canary.group.broker-2 in state PreparingRebalance with old   generation 501 (__consumer_offsets-20) (reason: Removing member   consumer-amazon.msk.canary.group.broker-2-252-74ea29cd-7118-4701-9525-ca3932c1c019   on LeaveGroup; client reason: not provided)   (kafka.coordinator.group.GroupCoordinator)
[2022-08-12 18:32:36,194] INFO   [GroupCoordinator 2]: Group amazon.msk.canary.group.broker-2 with generation   502 is now empty (__consumer_offsets-20)   (kafka.coordinator.group.GroupCoordinator)
[2022-08-12 18:32:36,194] INFO   [GroupCoordinator 2]: Member   MemberMetadata(memberId=consumer-amazon.msk.canary.group.broker-2-252-74ea29cd-7118-4701-9525-ca3932c1c019,   groupInstanceId=None, clientId=consumer-amazon.msk.canary.group.broker-2-252,   clientHost=/INTERNAL_IP, sessionTimeoutMs=10000, rebalanceTimeoutMs=300000,   supportedProtocols=List(range)) has left group   amazon.msk.canary.group.broker-2 through explicit `LeaveGroup`; client   reason: not provided (kafka.coordinator.group.GroupCoordinator)
[2022-08-12 18:32:48,790] INFO   [SocketServer listenerType=ZK_BROKER, nodeId=2] Failed authentication with   /INTERNAL_IP (channelId=INTERNAL_IP-INTERNAL_IP-785)   ([8ae54768-2c4b-43c1-b840-600f68ad2648]: Internal error)   (org.apache.kafka.common.network.Selector)
[2022-08-12 18:33:16,396] INFO Kafka   version: 3.2.0 (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:33:16,396] INFO Kafka   commitId: unknown (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:33:16,396] INFO Kafka   startTimeMs: 1660329196396 (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:33:16,492] INFO Kafka   version: 3.2.0 (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:33:16,492] INFO Kafka   commitId: unknown (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:33:16,492] INFO Kafka   startTimeMs: 1660329196492 (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:33:20,791] INFO App info   kafka.admin.client for adminclient-249 unregistered   (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:33:20,792] INFO Metrics   scheduler closed (org.apache.kafka.common.metrics.Metrics)
[2022-08-12 18:33:20,792] INFO Closing   reporter org.apache.kafka.common.metrics.JmxReporter   (org.apache.kafka.common.metrics.Metrics)
[2022-08-12 18:33:20,792] INFO Metrics   reporters closed (org.apache.kafka.common.metrics.Metrics)
[2022-08-12 18:33:20,794] INFO Metrics   scheduler closed (org.apache.kafka.common.metrics.Metrics)
[2022-08-12 18:33:20,794] INFO Closing   reporter org.apache.kafka.common.metrics.JmxReporter   (org.apache.kafka.common.metrics.Metrics)
[2022-08-12 18:33:20,794] INFO Metrics   reporters closed (org.apache.kafka.common.metrics.Metrics)
[2022-08-12 18:33:20,795] INFO App info   kafka.consumer for consumer-consumer-lag-249 unregistered   (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:33:30,460] INFO Kafka   version: 3.2.0 (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:33:30,460] INFO Kafka   commitId: unknown (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:33:30,460] INFO Kafka   startTimeMs: 1660329210460 (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:33:30,475] INFO Kafka   version: 3.2.0 (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:33:30,475] INFO Kafka   commitId: unknown (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:33:30,475] INFO Kafka   startTimeMs: 1660329210475 (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:33:31,465] INFO   [GroupCoordinator 1]: Dynamic member with unknown member id joins group   amazon.msk.canary.group.broker-1 in Empty state. Created a new member id   consumer-amazon.msk.canary.group.broker-1-252-e0e8aa18-a06b-4130-8bbb-4fa4f40a3577   and request the member to rejoin with this id.   (kafka.coordinator.group.GroupCoordinator)
[2022-08-12 18:33:31,470] INFO   [GroupCoordinator 1]: Preparing to rebalance group   amazon.msk.canary.group.broker-1 in state PreparingRebalance with old   generation 502 (__consumer_offsets-21) (reason: Adding new member   consumer-amazon.msk.canary.group.broker-1-252-e0e8aa18-a06b-4130-8bbb-4fa4f40a3577   with group instance id None; client reason: not provided)   (kafka.coordinator.group.GroupCoordinator)
[2022-08-12 18:33:31,400] INFO   [GroupCoordinator 2]: Dynamic member with unknown member id joins group   amazon.msk.canary.group.broker-2 in Empty state. Created a new member id   consumer-amazon.msk.canary.group.broker-2-253-f7c56246-8492-4df7-b457-5e5e4dd28b66   and request the member to rejoin with this id.   (kafka.coordinator.group.GroupCoordinator)
[2022-08-12 18:33:31,491] INFO   [GroupCoordinator 2]: Preparing to rebalance group   amazon.msk.canary.group.broker-2 in state PreparingRebalance with old   generation 502 (__consumer_offsets-20) (reason: Adding new member   consumer-amazon.msk.canary.group.broker-2-253-f7c56246-8492-4df7-b457-5e5e4dd28b66   with group instance id None; client reason: not provided)   (kafka.coordinator.group.GroupCoordinator)
[2022-08-12 18:33:34,559] INFO   [GroupCoordinator 1]: Stabilized group amazon.msk.canary.group.broker-1   generation 503 (__consumer_offsets-21) with 1 members   (kafka.coordinator.group.GroupCoordinator)
[2022-08-12 18:33:34,567] INFO App info   kafka.admin.client for adminclient-249 unregistered   (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:33:34,568] INFO Metrics   scheduler closed (org.apache.kafka.common.metrics.Metrics)
[2022-08-12 18:33:34,568] INFO Closing   reporter org.apache.kafka.common.metrics.JmxReporter   (org.apache.kafka.common.metrics.Metrics)
[2022-08-12 18:33:34,568] INFO Metrics   reporters closed (org.apache.kafka.common.metrics.Metrics)
[2022-08-12 18:33:34,569] INFO Metrics   scheduler closed (org.apache.kafka.common.metrics.Metrics)
[2022-08-12 18:33:34,569] INFO Closing   reporter org.apache.kafka.common.metrics.JmxReporter   (org.apache.kafka.common.metrics.Metrics)
[2022-08-12 18:33:34,569] INFO Metrics   reporters closed (org.apache.kafka.common.metrics.Metrics)
[2022-08-12 18:33:34,569] INFO App info   kafka.consumer for consumer-consumer-lag-249 unregistered   (org.apache.kafka.common.utils.AppInfoParser)
[2022-08-12 18:33:34,758] INFO   [GroupCoordinator 1]: Assignment received from leader   consumer-amazon.msk.canary.group.broker-1-252-e0e8aa18-a06b-4130-8bbb-4fa4f40a3577   for group amazon.msk.canary.group.broker-1 for generation 503. The group has   1 members, 0 of which are static. (kafka.coordinator.group.GroupCoordinator)
[2022-08-12 18:33:34,492] INFO   [GroupCoordinator 2]: Stabilized group amazon.msk.canary.group.broker-2   generation 503 (__consumer_offsets-20) with 1 members   (kafka.coordinator.group.GroupCoordinator)
[2022-08-12 18:33:34,499] INFO   [GroupCoordinator 2]: Assignment received from leader   consumer-amazon.msk.canary.group.broker-2-253-f7c56246-8492-4df7-b457-5e5e4dd28b66   for group amazon.msk.canary.group.broker-2 for generation 503. The group has   1 members, 0 of which are static. (kafka.coordinator.group.GroupCoordinator)
[2022-08-12 18:33:36,563] INFO   [GroupCoordinator 1]: Preparing to rebalance group   amazon.msk.canary.group.broker-1 in state PreparingRebalance with old   generation 503 (__consumer_offsets-21) (reason: Removing member   consumer-amazon.msk.canary.group.broker-1-252-e0e8aa18-a06b-4130-8bbb-4fa4f40a3577   on LeaveGroup; client reason: not provided)   (kafka.coordinator.group.GroupCoordinator)
[2022-08-12 18:33:36,564] INFO   [GroupCoordinator 1]: Group amazon.msk.canary.group.broker-1 with generation   504 is now empty (__consumer_offsets-21)   (kafka.coordinator.group.GroupCoordinator)
[2022-08-12 18:33:35,497] INFO   [GroupCoordinator 2]: Preparing to rebalance group   amazon.msk.canary.group.broker-2 in state PreparingRebalance with old   generation 503 (__consumer_offsets-20) (reason: Removing member   consumer-amazon.msk.canary.group.broker-2-253-f7c56246-8492-4df7-b457-5e5e4dd28b66   on LeaveGroup; client reason: not provided)   (kafka.coordinator.group.GroupCoordinator)
[2022-08-12 18:33:35,497] INFO   [GroupCoordinator 2]: Group amazon.msk.canary.group.broker-2 with generation   504 is now empty (__consumer_offsets-20)   (kafka.coordinator.group.GroupCoordinator)
[2022-08-12 18:33:35,498] INFO   [GroupCoordinator 2]: Member   MemberMetadata(memberId=consumer-amazon.msk.canary.group.broker-2-253-f7c56246-8492-4df7-b457-5e5e4dd28b66,   groupInstanceId=None, clientId=consumer-amazon.msk.canary.group.broker-2-253,   clientHost=/INTERNAL_IP, sessionTimeoutMs=10000, rebalanceTimeoutMs=300000,   supportedProtocols=List(range)) has left group   amazon.msk.canary.group.broker-2 through explicit `LeaveGroup`; client   reason: not provided (kafka.coordinator.group.GroupCoordinator)
[2022-08-12 18:33:36,564] INFO   [GroupCoordinator 1]: Member   MemberMetadata(memberId=consumer-amazon.msk.canary.group.broker-1-252-e0e8aa18-a06b-4130-8bbb-4fa4f40a3577,   groupInstanceId=None, clientId=consumer-amazon.msk.canary.group.broker-1-252,   clientHost=/INTERNAL_IP, sessionTimeoutMs=10000, rebalanceTimeoutMs=300000,   supportedProtocols=List(range)) has left group   amazon.msk.canary.group.broker-1 through explicit `LeaveGroup`; client   reason: not provided (kafka.coordinator.group.GroupCoordinator)

Environment:

Additional context The weird part is that we load tested a simple repo with the same code with 30+ lambdas and 4K messages, and it worked as expected, although we manually invoked the lambda.

prad9 commented 1 year ago

Same issue connecting to Confluent cloud hosted cluster from producer. This is intermittent issue but doesn't happen all times

kafkajs: 2.1.0 @kafkajs/confluent-schema-registry: 3.3.0

Configuration:

const sasl: SASLOptions = {
    username: secrets.kafkaClusterApiKey,
    password: secrets.kafkaClusterApiSecret,
    mechanism: "plain",
};
const kafkaConfig: KafkaConfig = {
    clientId: clientId,
    brokers: [kafkaClusterEndpoint],
    ssl: true,
    sasl,
};
const kafka = new Kafka(kafkaConfig);
const producer = kafka.producer({ createPartitioner: Partitioners.DefaultPartitioner });
logger.info("Kafka producer created");

await producer.connect();

Error logs:


{
    "level": "ERROR",
    "timestamp": "2023-03-27T02:34:39.463Z",
    "logger": "kafkajs",
    "message": "[Connection] Connection error: Client network socket disconnected before secure TLS connection was established",
    "broker": "****:9092",
    "clientId": "agents-call-data-collection",
    "stack": "Error: Client network socket disconnected before secure TLS connection was established\n    at connResetException (internal/errors.js:639:14)\n    at TLSSocket.onConnectEnd (_tls_wrap.js:1570:19)\n    at TLSSocket.emit (events.js:412:35)\n    at endReadableNT (internal/streams/readable.js:1333:12)\n    at processTicksAndRejections (internal/process/task_queues.js:82:21)"
}

2023-03-27T02:34:39.463Z    bb588c6d-58e6-44ba-abf5-aef5d106a4dc    ERROR   
{
    "level": "ERROR",
    "timestamp": "2023-03-27T02:34:39.463Z",
    "logger": "kafkajs",
    "message": "[BrokerPool] Failed to connect to seed broker, trying another broker from the list: Connection error: Client network socket disconnected before secure TLS connection was established",
    "retryCount": 0,
    "retryTime": 334
}```
dhvidding-accenture commented 1 year ago

What is the right way to use kafkajs with lambda as I assume you would be making a lot more connections than you normally would with a microservice which can easily manage a long-standing connection and reuse rather than lambda invocations that get stood up and torn down ad hoc. With AWS MSK with IAM, it only supports 20 connections per second for large brokers which if you are chatting across a bunch of topics can easily be maxxed.

Borduhh commented 1 year ago

We just removed the IAM authentication and put everything in a private VPC. That blocks the outside world from accessing the cluster and allows our Lambdas to access the internet via a NAT gateway. We have been using this setup in our production environment now for almost a year as a backbone event source, and it is working beautifully.

The reasoning was that IAM authentication was causing CPU issues with the constant connects/disconnects in the MSK cluster.

dhvidding-accenture commented 1 year ago

Thank you @Borduhh this may be the solution for us as well

cjnoname commented 11 months ago

We just removed the IAM authentication and put everything in a private VPC. That blocks the outside world from accessing the cluster and allows our Lambdas to access the internet via a NAT gateway. We have been using this setup in our production environment now for almost a year as a backbone event source, and it is working beautifully.

The reasoning was that IAM authentication was causing CPU issues with the constant connects/disconnects in the MSK cluster.

Hi @Borduhh ,

We are facing the same issue. Could you show me how did you remove the IAM authentication on MSK?

Thanks

Borduhh commented 11 months ago

We just removed the IAM authentication and put everything in a private VPC. That blocks the outside world from accessing the cluster and allows our Lambdas to access the internet via a NAT gateway. We have been using this setup in our production environment now for almost a year as a backbone event source, and it is working beautifully. The reasoning was that IAM authentication was causing CPU issues with the constant connects/disconnects in the MSK cluster.

Hi @Borduhh ,

We are facing the same issue. Could you show me how did you remove the IAM authentication on MSK?

Thanks

When you create the instance, do not select AWS IAM as an Authentication Type. I believe it has to be in a private Subnet to do so.

Screenshot 2023-10-31 at 9 46 30 AM

alvincrisuy commented 11 months ago

Hi @Borduhh what authentication type do you use if IAM is not used? Also if there is no bootstrap servers, what can we use for the list of brokers?

My current setup is in IAM and already in VPC.

Thank you!

dhvidding-accenture commented 11 months ago

You can use mutual TLS or SASL/SCRAM and secrets manager for authentication as noted in their documentation here: https://docs.aws.amazon.com/msk/latest/developerguide/kafka_apis_iam.html. You can use Kafka ACL's for authorization if you need to as well. It looks like both mutual TLS and SASL/SCRAM with secrets manager are supported by the Lambda triggers as well if that is your setup.

cjnoname commented 11 months ago

Hi,

I resolved this issue by implementing the singleton pattern for the producer and gracefully disconnecting the producer after the Lambda was shut down.

Source code:

import { Type, awsIamAuthenticator } from "@jm18457/kafkajs-msk-iam-authentication-mechanism";
import { Kafka, CompressionTypes } from "kafkajs";
import { v4 as uuid } from "uuid";
import type { Admin, Message, Producer } from "kafkajs";

let kafka: Kafka | undefined;
let kafkaProducer: Producer | undefined;
let kafkaAdmin: Admin | undefined;

// SIGTERM for gracefully shutdown
process.on("SIGTERM", async () => {
  console.info("[runtime] SIGTERM received");

  console.info("[runtime] cleaning up producer..", kafkaProducer);
  // perform actual clean up work here.
  await kafkaProducer?.disconnect();

  console.info("[runtime] exiting");
  process.exit(0);
});

const getKafkaClient = () => {
  if (!kafka) {
    kafka = new Kafka({
      clientId: uuid(),
      ssl: true,
      brokers: process.env.MSK_CLUSTER_BROKERS!.split(","),
      retry: {
        retries: 10,
        initialRetryTime: 100,
        factor: 0.2,
        multiplier: 3,
        maxRetryTime: 300000
      },
      sasl: {
        mechanism: Type,
        authenticationProvider: awsIamAuthenticator({ region: process.env.AWS_REGION! })
      }
    });
  }
  return kafka;
};

export const getKafkaAdmin = () => {
  if (!kafkaAdmin) {
    kafkaAdmin = getKafkaClient().admin();
  }
  return kafkaAdmin;
};

export interface KafkaPayload<T = any> {
  data: T;
  uniqueKey: any;
}

export const sendKafkaMessages = async ({
  data,
  topic,
  partition,
  isRetry
}: {
  data: any[];
  topic: string;
  // set the key if the events need to be run in sequence
  partition?: number;
  isRetry?: boolean;
}) => {
  if (!kafkaProducer) {
    kafkaProducer = getKafkaClient().producer({
      idempotent: true
    });
    await kafkaProducer.connect();
  }
  await kafkaProducer.send({
    messages: data.map((data) => {
      const kafkaPayload: KafkaPayload = {
        data,
        uniqueKey: uuid()
      };
      const message: Message = {
        value: JSON.stringify(kafkaPayload),
        partition,
        headers: isRetry ? { ["isRetry"]: "" } : undefined
      };
      return message;
    }),
    topic: topic,
    compression: CompressionTypes.GZIP
  });
};

Tips:

1) Create a single producer with active connection, store it somewhere and reuse it instead of connecting for each call. 2) Use SIGTERM for closing the half-opened connection when Lambda was shut down.

After this change, the active connection on broker dropped from 3000 (AWS hard limit) to 100, and the error was gone.

Reference: https://github.com/tulios/kafkajs/issues/1423#issuecomment-1193800427

dhvidding-accenture commented 11 months ago

Yes, but if you are using a straight lambda architecture, you cannot keep the singleton alive all the time like this. We actually have something very similar but since there are many lambda invocations it still means a lot of created connections which unfortunately with IAM I guess is too costly an operation. We have not tried it yet, but hoping that AWS MSK Serverless with IAM might help elastically scale as we need connections.

Borduhh commented 11 months ago

Hi @Borduhh what authentication type do you use if IAM is not used? Also if there is no bootstrap servers, what can we use for the list of brokers?

My current setup is in IAM and already in VPC.

Thank you!

We don't have any authentication. Since the cluster is in the private subnet of our VPC, only our internal resources can access it.

cjnoname commented 11 months ago

Yes, but if you are using a straight lambda architecture, you cannot keep the singleton alive all the time like this. We actually have something very similar but since there are many lambda invocations it still means a lot of created connections which unfortunately with IAM I guess is too costly an operation. We have not tried it yet, but hoping that AWS MSK Serverless with IAM might help elastically scale as we need connections.

As I tried, it can keep alive. Because once it's connected, it will automatically reconnect by kafkajs once the current connection was closed, so you don't need to do anything.

I don't know how many Lambdas do you have running in parallel. There are multiple brokers for MSK, each broker can handle 3000 connections, and you can add the broker number to whatever numbers you want. I believe there won't be any bottlenecks even if you have 10000 Lambda running concurrently for MSK if you increased the brokers to proper size.

Would suggest to try my code and see if it works. Cheers buddy.