High memory consumption during long running jobs

PauliusPeciura commented 4 years ago

Bug description We found that memory consumption is fairly high on one of the service nodes that uses the Spring Batch. Even though both data nodes did a similar amount of work, the memory consumption across nodes was not even - 15GB vs 1.5GB (see memory use screenshot).

We have some jobs that could run for seconds while others might run for hours, so we set the polling interval (MessageChannelPartitionHandler#setPollInterval) to 1 second rather than 10 seconds that is the default. In a large running job scenario, we ended up creating 837 step executions.

What I found was that MessageChannelPartitionHandler#pollReplies gets a full StepExecution representation for each step, which contains a JobExecution which would also contain StepExecutions for each. However, they are retrieved at different times and stages. This means that we end up with square number of StepExecution objects, e.g. 837*837=700569 StepExecutions (see screenshot below)

Environment Initially reproduced on Spring Batch 4.1.4.

Expected behavior My proposal would be to:

Issue a SQL query to get the count of running StepExecutions instead of retrieving DTOs. This way there is less objects loaded into the heap.
Once all steps are finished, then query for all StepExecutions for that job. We can then assign the same JobExecution to each step.

Memory usage graph comparison between two service nodes, doing roughly equal amount of work:

memoryUse - redacted

My apologies for a messy screenshot, but it does explain the number of StepExecution objects:

stepExecutions - redacted

ssanghavi-appdirect commented 3 years ago

We are facing same issue. When number of steps in a job increases it leads to OOM, killing the manager jvm. Is there a plan to fix this?

fmbenhassine commented 3 years ago

@PauliusPeciura Thank you for reporting this issue and for opening a PR! I would like to be able to reproduce the issue first in order to validate a fix if any. From your usage of MessageChannelPartitionHandler, I understand that this is related to a remote partitioning setup. However, you did not share your job/step configuration. Is a job with a single partitioned step configured with a high number of worker steps enough to reproduce the issue? Do you think the same problem would happen locally with a TaskExecutorPartitionHandler (this would be easier to test in comparison to a remote partitioning setup)? I would be grateful if you could share more details on your configuration or provide a minimal example.

@ssanghavi-appdirect Yes. If we can reproduce the issue in a reliable manner, we will plan a fix for one of the upcoming releases.

ssanghavi-appdirect commented 3 years ago

@benas I am able to reproduce with TaskExecutorPartitionHandler as well. However the fix provided by @PauliusPeciura is very specific to DB polling and won't fix what I reproduced with TaskExecutorPartitionHandler Basically this issue can occur in any code path that is holding references to StepExecution objects returned by JobExplorer.getStepExecution. Similar code exists in RemoteStepExecutionAggregator.aggregate() and MessageChannelPartitionHandler.pollReplies.

Scenario to reproduce: create a job with more than 900 remote partitions, wait for it to complete. Observe that manager jvm fails with OOM if -Xmx is set else memory consumption keeps on increasing. Issue can be reproduced with both MessageChannelPartitionHandler and TaskExecutorPartitionHandler. We are able to reproduce issue both using DB polling and request-reply channel while using MessageChannelPartitionHandler.

What is the most convenient way to share code that reproduces issue?

ssanghavi-appdirect commented 3 years ago

Attaching a spring boot project that can reproduce issue with TaskExecutorPartitionHandler. It requires maven and java 11 to run.

Steps to execute the program

Download the attached zip file and extract the contents
Navigate to spring-batch-remoting directory that is created by step# 1
Run maven command to build mvn clean install
Start java process with java -Xmx250m -jar target/spring-batch-remoting-0.0.1-SNAPSHOT.jar

spring-batch-remoting.zip

cazacmarin commented 2 years ago

Will this picture help, guys? will it indicate that using last Spring batch version, you will really agree that you have a memory leak inside?

fmbenhassine commented 1 year ago

Thank you all for for your feedback here! This is a valid performance issue. There is definitely no need to load the entire object graph of step executions when polling the status of workers.

Ideally, polling for running workers could be done with a single query, and once they are all done, we should grab shallow copies of step executions with the minimum required to do the aggregation.

I will plan the fix for the upcoming 5.0.1 / 4.3.8.

galovics commented 1 year ago

@fmbenhassine I'm afraid the issue is still present. I've checked the commit you made but since it's still working with entities, the associations are still there.

Here's a snapshot from a heap dump I've taken:

And here's the relevant stacktrace where the objects are coming from:

Scheduler1_Worker-1
  at java.lang.Thread.sleep(J)V (Native Method)
  at org.springframework.batch.poller.DirectPoller$DirectPollingFuture.get(JLjava/util/concurrent/TimeUnit;)Ljava/lang/Object; (DirectPoller.java:109)
  at org.springframework.batch.poller.DirectPoller$DirectPollingFuture.get()Ljava/lang/Object; (DirectPoller.java:80)
  at org.springframework.batch.integration.partition.MessageChannelPartitionHandler.pollReplies(Lorg/springframework/batch/core/StepExecution;Ljava/util/Set;)Ljava/util/Collection; (MessageChannelPartitionHandler.java:288)
  at org.springframework.batch.integration.partition.MessageChannelPartitionHandler.handle(Lorg/springframework/batch/core/partition/StepExecutionSplitter;Lorg/springframework/batch/core/StepExecution;)Ljava/util/Collection; (MessageChannelPartitionHandler.java:251)
  at org.springframework.batch.core.partition.support.PartitionStep.doExecute(Lorg/springframework/batch/core/StepExecution;)V (PartitionStep.java:106)
  at org.springframework.batch.core.step.AbstractStep.execute(Lorg/springframework/batch/core/StepExecution;)V (AbstractStep.java:208)
  at org.springframework.batch.core.job.SimpleStepHandler.handleStep(Lorg/springframework/batch/core/Step;Lorg/springframework/batch/core/JobExecution;)Lorg/springframework/batch/core/StepExecution; (SimpleStepHandler.java:152)
  at org.springframework.batch.core.job.AbstractJob.handleStep(Lorg/springframework/batch/core/Step;Lorg/springframework/batch/core/JobExecution;)Lorg/springframework/batch/core/StepExecution; (AbstractJob.java:413)
  at org.springframework.batch.core.job.SimpleJob.doExecute(Lorg/springframework/batch/core/JobExecution;)V (SimpleJob.java:136)
  at org.springframework.batch.core.job.AbstractJob.execute(Lorg/springframework/batch/core/JobExecution;)V (AbstractJob.java:320)
  at org.springframework.batch.core.launch.support.SimpleJobLauncher$1.run()V (SimpleJobLauncher.java:149)
  at org.springframework.core.task.SyncTaskExecutor.execute(Ljava/lang/Runnable;)V (SyncTaskExecutor.java:50)
  at org.springframework.batch.core.launch.support.SimpleJobLauncher.run(Lorg/springframework/batch/core/Job;Lorg/springframework/batch/core/JobParameters;)Lorg/springframework/batch/core/JobExecution; (SimpleJobLauncher.java:140)
  ...
  at org.springframework.scheduling.quartz.QuartzJobBean.execute(Lorg/quartz/JobExecutionContext;)V (QuartzJobBean.java:75)
  at org.quartz.core.JobRunShell.run()V (JobRunShell.java:202)
  at org.quartz.simpl.SimpleThreadPool$WorkerThread.run()V (SimpleThreadPool.java:573)

Note: this specific job could run for hours and processes a lot of data (millions of records). When the number of partitions exceed 500 (not the threshold) the manager is slowly accumulating more and more memory. As a mitigation, I've reduced the number of partitions to 36ish and now it doesn't fail. Probably it's still consuming more and more memory but finishes before it starts to run OOM.

fmbenhassine commented 1 year ago

@galovics Thank you for reporting this.

I'm afraid the issue is still present. I've checked the commit you made but since it's still working with entities, the associations are still there.

We will always work with entities according to the domain model. What we can do is reduce the number of entities loaded in memory to the minimum required. Before https://github.com/spring-projects/spring-batch/commit/93800c6bae679fdafdced45f1c22b2c150e33ed6, the code was loading job executions in a loop for every partitioned step execution, which is obviously not necessary.

In your screenshot, I see you have several JobExecution objects with different IDs. Are you running several job instances in the same JVM and sharing the MessageChannelPartitionHanlder between them?

To correctly address any performance issue, we need to analyse the performance for a single job execution first. So I am expecting to see a single job execution in memory with a partitioned step. Once we ensure that a single partitioned execution is optimized, we can discuss if the packaging/deployment pattern is suitable to run several job executions in the same JVM or not.

Please open a separate issue and provide a minimal example to be sure we are addressing your specific issue and we will dig deeper. Thank you upfront.

galovics commented 1 year ago

@fmbenhassine

In your screenshot, I see you have several JobExecution objects with different IDs. Are you running several job instances in the same JVM and sharing the MessageChannelPartitionHanlder between them?

That's strange to me too. I re-read the Spring Batch docs on job instances to use the same terminology and understanding and I can confirm there's a single job instance being run. In fact it's the book example of the Spring Batch docs. It's a remote partitioned end of day job (close of business (COB) as we refer to it) running once each day.

I can even show the code to you cause the project is open-source. Here's the whole manager configuration: https://github.com/apache/fineract/blob/dbfedf5cfdffbddfd400f51498c02a88c0551bd1/fineract-provider/src/main/java/org/apache/fineract/cob/loan/LoanCOBManagerConfiguration.java Here's the worker configuration: https://github.com/apache/fineract/blob/dbfedf5cfdffbddfd400f51498c02a88c0551bd1/fineract-provider/src/main/java/org/apache/fineract/cob/loan/LoanCOBWorkerConfiguration.java

fmbenhassine commented 1 year ago

Thank you for your feedback.

I can confirm there's a single job instance being run

In that case, there should really be a single JobExecution object in memory. By design, Spring Batch does not allow concurrent job executions of the same job instance. Therefore, if a single job instance is launched within a JVM, there should be a single job execution for that instance running at a time (and consequently, a single JobExecution object in memory). That is the setup we need to analyse the performance issue.

As mentioned previously, as this issue has been closed and assigned to a release, please open a separate one with all these details and I will take a look. Thank you upfront.

pstetsuk commented 10 months ago

We have the same problem. I modified PR #3791so it can be merged to main branch

hpoettker commented 6 months ago

@galovics @pstetsuk If you find the time, it would be interesting to hear whether #4599 improves the situation for you.

pstetsuk commented 6 months ago

@hpoettker our problem is that we have thousands steps and all of then load in-memory every time it checks step result. It leads to OutOfMemory. Your fix doesn't change this behavior and can't resolve the problem. In the fix from @galovics it doesn't load all the steps but get the count of incomplete steps from the database. It works much faster and consumer much less memory.

spring-projects / spring-batch

High memory consumption during long running jobs #3790

Will this picture help, guys? will it indicate that using last Spring batch version, you will really agree that you have a memory leak inside?