temporalio / sdk-typescript

Temporal TypeScript SDK
Other
525 stars 104 forks source link

[Bug] Memory leak on the temporal worker side(OperatorSubscriber Class Objects are taking much of the memory) #1541

Open jainshivsagar opened 1 week ago

jainshivsagar commented 1 week ago

What are you really trying to do?

We are using the TypeScript SDK v1.9.0 for our temporal worker. We are doing the load testing in the local System. At the start of the worker the Memory(or Heap) utilization was <120MB, post the load testing, the Memory(or Heap) increased to ~300MB. After 5-10 minutes of load testing the memory utilization didn't come down. Please refer to the below screenshots of Chrome DevTool(Memory Profiler):-

Memory Utilization after starting the worker:-

image

Memory Utilization after load testing:- image

Comparison of first two Heap snapshots:- image

Environment/Versions

Worker Configurations:-

  const worker = await Worker.create({
    connection,
    namespace: config.Temporal.Namespace,
    taskQueue: config.Temporal.TaskQueue,
    workflowsPath: require.resolve('./workflows'),
    activities: getActivities(),
    // maxActivitiesPerSecond: 100,
    // maxTaskQueueActivitiesPerSecond: 100,
    // maxConcurrentActivityTaskExecutions: 2000,
    maxConcurrentWorkflowTaskExecutions: 200,
    maxConcurrentWorkflowTaskPolls: 100,
    // maxConcurrentActivityTaskPolls: 1000,
    // maxCachedWorkflows: 3000,
    maxConcurrentLocalActivityExecutions: 200,
    // workflowThreadPoolSize: 100,
    enableNonLocalActivities: false
  });
mjameswh commented 1 week ago

After 5-10 minutes of load testing the memory utilization didn't come down.

What happens if you continue feeding workflows to the worker? Does memory continue to go up, or does it stay stable around some value, e.g. ~300MB?

The Worker caches Workflows in an LRU; a Workflow stays in the cache until it gets evicted either 1) to make room for another workflow that’s coming in, or 2) because processing of a Workflow Task failed. Completion of a Workflow doesn’t result in eviction.

That means that, assuming there’s no Workflow Task failures, the sticky cache size should quickly grow up to its maximum value, and then stay at that number for very long period of time (i.e. until the pod gets shutdown).

jainshivsagar commented 1 week ago

Hi @mjameswh , I posted the above data after 3-4 rounds of load testing. After each round of testing, I observed that heap memory utilization was growing incrementally. After the 4th round of testing & waiting for 5-10 mins, the heap memory utilization didn't come down.

mjameswh commented 1 day ago

waiting for 5-10 mins, the heap memory utilization didn't come down.

As I said before, we do not expect a Worker's memory usage to come down once a Workflow has completed. Completed Workflows may still be queried, so caching them may still be beneficial.

What we'd expect is for memory usage to grow until the cache size reaches its maximum capacity (maxCachedWorkflows), after which memory usage should remain relatively stable, as less recently used Workflows will get evicted from cache.

After each round of testing, I observed that heap memory utilization was growing incrementally.

Your screenshot indicates 43'890 instances of OperatorSubscriber. That's certainly a lot, yet could still be legitimate, depending on the number of cached Workflows; there are multiple OperatorSubscriber per cached Workflows and per pending Activities.