[Bug] Memory leak on the temporal worker side(OperatorSubscriber Class Objects are taking much of the memory)

jainshivsagar commented 1 week ago

What are you really trying to do?

We are using the TypeScript SDK v1.9.0 for our temporal worker. We are doing the load testing in the local System. At the start of the worker the Memory(or Heap) utilization was <120MB, post the load testing, the Memory(or Heap) increased to ~300MB. After 5-10 minutes of load testing the memory utilization didn't come down. Please refer to the below screenshots of Chrome DevTool(Memory Profiler):-

Memory Utilization after starting the worker:-

Memory Utilization after load testing:-

Comparison of first two Heap snapshots:-

Environment/Versions

OS and processor: 2.6 GHz 6-Core Intel Core i7, MacOs Sequoia(v15.0.1)
Temporal Version: Temporal TypeScript SDK v1.9.0
We running the worker in K8 nodes on the Staging/Prod ENV.

Worker Configurations:-

  const worker = await Worker.create({
    connection,
    namespace: config.Temporal.Namespace,
    taskQueue: config.Temporal.TaskQueue,
    workflowsPath: require.resolve('./workflows'),
    activities: getActivities(),
    // maxActivitiesPerSecond: 100,
    // maxTaskQueueActivitiesPerSecond: 100,
    // maxConcurrentActivityTaskExecutions: 2000,
    maxConcurrentWorkflowTaskExecutions: 200,
    maxConcurrentWorkflowTaskPolls: 100,
    // maxConcurrentActivityTaskPolls: 1000,
    // maxCachedWorkflows: 3000,
    maxConcurrentLocalActivityExecutions: 200,
    // workflowThreadPoolSize: 100,
    enableNonLocalActivities: false
  });

mjameswh commented 1 week ago

After 5-10 minutes of load testing the memory utilization didn't come down.

What happens if you continue feeding workflows to the worker? Does memory continue to go up, or does it stay stable around some value, e.g. ~300MB?

The Worker caches Workflows in an LRU; a Workflow stays in the cache until it gets evicted either 1) to make room for another workflow that’s coming in, or 2) because processing of a Workflow Task failed. Completion of a Workflow doesn’t result in eviction.

That means that, assuming there’s no Workflow Task failures, the sticky cache size should quickly grow up to its maximum value, and then stay at that number for very long period of time (i.e. until the pod gets shutdown).

jainshivsagar commented 1 week ago

Hi @mjameswh , I posted the above data after 3-4 rounds of load testing. After each round of testing, I observed that heap memory utilization was growing incrementally. After the 4th round of testing & waiting for 5-10 mins, the heap memory utilization didn't come down.

mjameswh commented 1 day ago

waiting for 5-10 mins, the heap memory utilization didn't come down.

As I said before, we do not expect a Worker's memory usage to come down once a Workflow has completed. Completed Workflows may still be queried, so caching them may still be beneficial.

What we'd expect is for memory usage to grow until the cache size reaches its maximum capacity (maxCachedWorkflows), after which memory usage should remain relatively stable, as less recently used Workflows will get evicted from cache.

After each round of testing, I observed that heap memory utilization was growing incrementally.

How many Workflows get started per round?
What is the capacity of your workflow cache? i.e. maxCachedWorkflows? If unsure, look at your Worker Options printed to logs on Worker start.
Can you please try this with SDK 1.11.2?
In your heap snapshot, how many instances of VMWorkflowThreadProxy do you have? How many instances of Activity?

Your screenshot indicates 43'890 instances of OperatorSubscriber. That's certainly a lot, yet could still be legitimate, depending on the number of cached Workflows; there are multiple OperatorSubscriber per cached Workflows and per pending Activities.

At which point in your test sequence was your heap snapshot taken?
By any chance, would it be possible for you to share your heap snapshot file? If you are a Temporal Cloud user, you may open a Zendesk ticket and include the file as a secure attachment.
Otherwise, could you please try to provide reproduction code that demonstrates this issue?

temporalio / sdk-typescript