socketio / socket.io

Realtime application framework (Node.JS server)
https://socket.io
MIT License
61.17k stars 10.11k forks source link

Possible Memory Leak Issue #4738

Open devashish-gopalani-cognoai opened 1 year ago

devashish-gopalani-cognoai commented 1 year ago

Describe the bug I have been trying to load test my Socket.IO service using artillery (attached the script by the name of load-test.yml). Unfortunately, when I do so, I end up getting the below results:

All VUs finished. Total time: 12 minutes, 42 seconds
--------------------------------
Summary report @ 12:39:47(+0530)
--------------------------------
engine.socketio.emit: .......................................................... 137254
engine.socketio.emit_rate: ..................................................... 182/sec
errors.Error: timeout: ......................................................... 110
errors.Error: websocket error: ................................................. 317
errors.timeout: ................................................................ 150
errors.websocket error: ........................................................ 257
vusers.completed: .............................................................. 1774
vusers.created: ................................................................ 2400
vusers.created_by_name.Load Test Scenario: ..................................... 2400
vusers.failed: ................................................................. 626
vusers.session_length:
min: ......................................................................... 140078.2
max: ......................................................................... 200970.3
median: ...................................................................... 148798.3
p95: ......................................................................... 167770.3
p99: ......................................................................... 174617.4
Log file: test-case-7.json

Now the vusers.failed count has also gone upto 1500 (in other tests that I have ran). To debug further, I enabled logs of my server and figured out that there seems to be a memory leak and hence the service was stopping and restarting again. The log which I found on the server was:

Jun 09 12:10:37 ip-10-100-2-215 /var/log/socket_service.log[45021]: <--- Last few GCs --->
Jun 09 12:10:37 ip-10-100-2-215 /var/log/socket_service.log[45021]: [45021:0x69022b0]   100805 ms: Scavenge (reduce) 2041.4 (2082.5) -> 2040.7 (2082.7) MB, 4.0 / 0.0 ms  (average mu = 0.308, current mu = 0.292) allocation failure;
Jun 09 12:10:37 ip-10-100-2-215 /var/log/socket_service.log[45021]: [45021:0x69022b0]   101973 ms: Mark-sweep (reduce) 2041.5 (2082.7) -> 2039.8 (2083.0) MB, 1165.3 / 0.0 ms  (average mu = 0.182, current mu = 0.050) allocation failure; scavenge might not succeed
Jun 09 12:10:37 ip-10-100-2-215 /var/log/socket_service.log[45021]: <--- JS stacktrace --->
Jun 09 12:10:37 ip-10-100-2-215 /var/log/socket_service.log[45021]: FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
Jun 09 12:10:37 ip-10-100-2-215 /var/log/socket_service.log[45021]:  1: 0xb7a940 node::Abort() [/home/ubuntu/.nvm/versions/node/v18.16.0/bin/node]
Jun 09 12:10:37 ip-10-100-2-215 /var/log/socket_service.log[45021]:  2: 0xa8e823  [/home/ubuntu/.nvm/versions/node/v18.16.0/bin/node]
Jun 09 12:10:37 ip-10-100-2-215 /var/log/socket_service.log[45021]:  3: 0xd5c940 v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [/home/ubuntu/.nvm/versions/node/v18.16.0/bin/node]
Jun 09 12:10:37 ip-10-100-2-215 /var/log/socket_service.log[45021]:  4: 0xd5cce7 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [/home/ubuntu/.nvm/versions/node/v18.16.0/bin/node]
Jun 09 12:10:37 ip-10-100-2-215 /var/log/socket_service.log[45021]:  5: 0xf3a3e5  [/home/ubuntu/.nvm/versions/node/v18.16.0/bin/node]
Jun 09 12:10:37 ip-10-100-2-215 /var/log/socket_service.log[45021]:  6: 0xf3b2e8 v8::internal::Heap::RecomputeLimits(v8::internal::GarbageCollector) [/home/ubuntu/.nvm/versions/node/v18.16.0/bin/node]
Jun 09 12:10:37 ip-10-100-2-215 /var/log/socket_service.log[45021]:  7: 0xf4b7f3  [/home/ubuntu/.nvm/versions/node/v18.16.0/bin/node]
Jun 09 12:10:37 ip-10-100-2-215 /var/log/socket_service.log[45021]:  8: 0xf4c668 v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [/home/ubuntu/.nvm/versions/node/v18.16.0/bin/node]
Jun 09 12:10:37 ip-10-100-2-215 /var/log/socket_service.log[45021]:  9: 0xf26fce v8::internal::HeapAllocator::AllocateRawWithLightRetrySlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [/home/ubuntu/.nvm/versions/node/v18.16.0/bin/node]
Jun 09 12:10:37 ip-10-100-2-215 /var/log/socket_service.log[45021]: 10: 0xf28397 v8::internal::HeapAllocator::AllocateRawWithRetryOrFailSlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [/home/ubuntu/.nvm/versions/node/v18.16.0/bin/node]
Jun 09 12:10:37 ip-10-100-2-215 /var/log/socket_service.log[45021]: 11: 0xf0956a v8::internal::Factory::NewFillerObject(int, v8::internal::AllocationAlignment, v8::internal::AllocationType, v8::internal::AllocationOrigin) [/home/ubuntu/.nvm/versions/node/v18.16.0/bin/node]
Jun 09 12:10:37 ip-10-100-2-215 /var/log/socket_service.log[45021]: 12: 0x12ce7af v8::internal::Runtime_AllocateInYoungGeneration(int, unsigned long*, v8::internal::Isolate*) [/home/ubuntu/.nvm/versions/node/v18.16.0/bin/node]
Jun 09 12:10:37 ip-10-100-2-215 /var/log/socket_service.log[45021]: 13: 0x16fb6b9  [/home/ubuntu/.nvm/versions/node/v18.16.0/bin/node]
Jun 09 12:10:37 ip-10-100-2-215 systemd[1]: socket-io.service: Main process exited, code=dumped, status=6/ABRT
Jun 09 12:10:37 ip-10-100-2-215 systemd[1]: socket-io.service: Failed with result 'core-dump'.

I have also taken heap snapshots and unfortunately, I don't find any of them leading to the code which I have written. Would this indicate that there is something wrong with the library? Please correct me if I am wrong. I am attaching the screenshots of the comparison done between heap snapshots:

Screenshot from 2023-06-15 09-27-44

Screenshot from 2023-06-15 09-28-16

Screenshot from 2023-06-15 09-29-41

Screenshot from 2023-06-15 09-33-56

To Reproduce

Socket.IO server version: 4.6.0

Server

index.js

const { Server } = require("socket.io");
const { createCustomLogger } = require("./utils/logger");
const logger = createCustomLogger();
// let garbageCollectorInterval = null;

import { createClient } from "redis";
import { createAdapter } from "@socket.io/redis-streams-adapter";
import { MAIN_NAMESPACE_PREFIX } from "./utils/constants";
import initializeCobrowseNamespace from "./namespaces/cobrowse";
import initializeLiveChatNamespace from "./namespaces/livechat";
import {
  REDIS_HOST,
  REDIS_PORT,
  CORS_WHITELISTED_DOMAINS,
} from "./utils/config";
const { instrument, RedisStore } = require("@socket.io/admin-ui");

const redisClient = createClient({
  socket: {
    host: REDIS_HOST,
    port: REDIS_PORT,
  },
});

try {
  redisClient.connect();
} catch (error) {
  logger.error(`Could not connect to redis client = ${error}`);
}

/*
If the serveClient flag were to be marked as true then make sure required changes are done in the 
webpack.config.js file.
Documentation - https://socket.io/docs/v4/server-with-bundlers/
*/

const io = new Server(3000, {
  path: "/websocket/",
  maxHttpBufferSize: 1e8, // 100 MB
  adapter: createAdapter(redisClient),
  cors: {
    origin: CORS_WHITELISTED_DOMAINS,
    credentials: true,
  },
  serveClient: false,
});

/*
Setting up the monitoring tool for the socket-io servers
*/
instrument(io, {
  auth: {
    type: "basic",
    username: "cognoai-test",
    password: "$2a$10$rhCtnxXuBRcBF2t3I4kFUOeHN/yLlXq1rl5iDQvSckuw7HVQ7eoZy",
  },
  namespaceName: "/admin-ui",
  serverId: `${require("os").hostname()}#${process.pid}`,
  store: new RedisStore(redisClient),
  mode: "production",
});

initializeCobrowseNamespace(io, logger);
// initializeLiveChatNamespace(io, logger);

/*
Event - "connection"
Event trigggered when a socket connection is established
*/

io.on("connection", (socket) => {
  logger.info(
    `${MAIN_NAMESPACE_PREFIX} request URL of the connection = ${socket.handshake.url}`
  );

  socket.on("test_event", (data) => {
    logger.info(`${MAIN_NAMESPACE_PREFIX} received data = ${data}`);
    io.emit("test_event_server", data);
  });

  socket.on("disconnect", (reason) => {
    logger.warn(
      `${MAIN_NAMESPACE_PREFIX} socket got disconnected due to the following reason = ${reason}`
    );
  });
});

/*
Event - "connection_error"
Event will be emitted when a connection is abnormally closed
*/

io.engine.on("connection_error", (errorObject) => {
  logger.error(
    `${MAIN_NAMESPACE_PREFIX} connection_error event with request obj = ${errorObject.req} | error code = ${errorObject.code} | error message = ${errorObject.message} | error context = ${errorObject.context}`
  );
});

cobrowse.js

import { COBROWSE_NAMESPACE_PREFIX } from "../utils/constants";

const attachRoomEventListeners = (cobrowseNamespace, logger) => {
  logger.info(`${COBROWSE_NAMESPACE_PREFIX} attaching room event listeners`);

  cobrowseNamespace.adapter.on("create-room", (room) => {
    logger.info(`${COBROWSE_NAMESPACE_PREFIX} room ${room} was created`);
  });

  cobrowseNamespace.adapter.on("delete-room", (room) => {
    logger.info(`${COBROWSE_NAMESPACE_PREFIX} room ${room} was deleted`);
  });

  cobrowseNamespace.adapter.on("join-room", (room, id) => {
    logger.info(
      `${COBROWSE_NAMESPACE_PREFIX} socket ${id} has joined room ${room}`
    );
  });

  cobrowseNamespace.adapter.on("leave-room", (room, id) => {
    logger.info(
      `${COBROWSE_NAMESPACE_PREFIX} socket ${id} has left room ${room}`
    );
  });
};

const initializeCobrowseNamespace = (io, logger) => {
  logger.info("Initialising co-browse namespace");

  const cobrowseNamespace = io.of("/cobrowse");

  attachRoomEventListeners(cobrowseNamespace, logger);

  cobrowseNamespace.on("connection", (socket) => {
    logger.info(
      `${COBROWSE_NAMESPACE_PREFIX} request URL of the connection = ${socket.handshake.url}`
    );

    /*
    Extracting the room key from the query parameter. Once we get the room key, we add the 
    socket to the room created on the basis of that room key.
    */
    let roomName = socket.handshake.query.roomKey;

    logger.info(
      `${COBROWSE_NAMESPACE_PREFIX} room key extracted = ${roomName}`
    );

    socket.join(roomName);

    socket.on("cobrowse_packet", (data) => {
      logger.info(
        `${COBROWSE_NAMESPACE_PREFIX} received data cobrowse_packet = ${Object.keys(
          data
        )}`
      );
      cobrowseNamespace.to(roomName).emit("cobrowse_packet", data);
    });

    socket.on("cobrowse_sync_utils_packet", (data) => {
      logger.info(
        `${COBROWSE_NAMESPACE_PREFIX} received data cobrowse_sync_utils_packet = ${Object.keys(
          data
        )}`
      );
      cobrowseNamespace.to(roomName).emit("cobrowse_sync_utils_packet", data);
    });

    socket.on("cobrowse_signal_packet", (data) => {
      logger.info(
        `${COBROWSE_NAMESPACE_PREFIX} received data cobrowse_signal_packet = ${Object.keys(
          data
        )}`
      );
      cobrowseNamespace.to(roomName).emit("cobrowse_signal_packet", data);
    });

    socket.on("app_cobrowse_packet", (data) => {
      logger.info(
        `${COBROWSE_NAMESPACE_PREFIX} received data app_cobrowse_packet = ${Object.keys(
          data
        )}`
      );
      cobrowseNamespace.to(roomName).emit("app_cobrowse_packet", data);
    });

    socket.on("disconnect", (reason) => {
      logger.warn(
        `${COBROWSE_NAMESPACE_PREFIX} socket got disconnected due to the following reason = ${reason}`
      );
    });
  });
};

export default initializeCobrowseNamespace;

Socket.IO client version: artillery-engine-socketio-v3 - v1.1.3

Client

Artillery .yaml script

config:
  target: "https://xyz.com"
  socketio:
    path: "/websocket"
    transports: ["websocket"]
    query:
      roomKey: "abcd"
  phases:
    - duration: 600
      arrivalRate: 5
  engines:
    socketio-v3: {}

scenarios:
  - name: Load Test Scenario
    engine: socketio-v3
    flow:
      - loop:
        - namespace: "/cobrowse"
          emit:
            channel: "cobrowse_packet"
            data: "world"
        - think: 1
        count: 70

Infrastructure Specifications: There are 2 servers each with 4 cores and 8GB ram. Both of them are running the Socket.IO instance and one of them is running redis. There is an AWS ALB present in front of the two websocket servers, I am using only websockets and not HTTP polling.

Additional context I also minify my code using webpack. The remaining package version details can be found in the package.json file. You can find my code over here WebSocketServiceExport.zip.

darrachequesne commented 1 year ago

@devashish-gopalani-cognoai could you try without DEBUG=*? Without the admin UI instrumentation?

devashish-gopalani-cognoai commented 1 year ago

@darrachequesne so I did remove the admin UI tool but the issue still persisted. Later on I found out that my artillery script was the issue. All socket connections which were getting created were joining the same room ("abcd"). I could not find a way to dynamically generate rooms in the artillery script but I applied some logic on the Socket.IO servers side and voila, the issue got resolved!

As of now, when the emit rate reaches around 4500 then I start getting a count for failed vusers. I have noticed that the CPU usage spikes up-to 60-70% at that time. I still had a few questions though:

  1. Is this kind of performance expected? When the emit rate reaches close to 4500, failure of vusers is expected?
  2. Is there any way in which I can generate dynamic room names (the value for roomKey in the above script) in artillery?
  3. When many connections join a single room then is this type of performance expected (the which I was getting in which JavaScript heap out of memory error was coming)?
carera commented 1 year ago

@devashish-gopalani-cognoai make sure to add cleanupEmptyChildNamespaces flag if you ever start using dynamic namespaces/rooms, otherwise your adapter will leak.

See https://github.com/socketio/socket.io/pull/4602

also, i see your script is using socket.io admin-ui which uses middlewares to reject socket connections. I just found another leak that you might eventually suffer from as well: https://github.com/socketio/socket.io/issues/4772

devashish-gopalani-cognoai commented 1 year ago

@carera I have commented the admin-ui part so memory leak because of that shouldn't happen now because of that. Coming to dynamic namespaces, I saw the issue which you shared. It seems to be only for dynamic namespaces and I am not creating namespaces dynamically. They are predefined. However I am creating rooms dynamically. Would this flag help over there too?

carera commented 1 year ago

I am not sure how rooms work in context of the redis adapter. One way to find out, i guess?