Closed PauliusPeciura closed 1 year ago
We are facing same issue. When number of steps in a job increases it leads to OOM, killing the manager jvm. Is there a plan to fix this?
@PauliusPeciura Thank you for reporting this issue and for opening a PR! I would like to be able to reproduce the issue first in order to validate a fix if any. From your usage of MessageChannelPartitionHandler
, I understand that this is related to a remote partitioning setup. However, you did not share your job/step configuration. Is a job with a single partitioned step configured with a high number of worker steps enough to reproduce the issue? Do you think the same problem would happen locally with a TaskExecutorPartitionHandler
(this would be easier to test in comparison to a remote partitioning setup)? I would be grateful if you could share more details on your configuration or provide a minimal example.
@ssanghavi-appdirect Yes. If we can reproduce the issue in a reliable manner, we will plan a fix for one of the upcoming releases.
@benas I am able to reproduce with TaskExecutorPartitionHandler
as well. However the fix provided by @PauliusPeciura is very specific to DB polling and won't fix what I reproduced with TaskExecutorPartitionHandler
Basically this issue can occur in any code path that is holding references to StepExecution
objects returned by JobExplorer.getStepExecution
. Similar code exists in RemoteStepExecutionAggregator.aggregate()
and MessageChannelPartitionHandler.pollReplies
.
Scenario to reproduce: create a job with more than 900 remote partitions, wait for it to complete. Observe that manager jvm fails with OOM if -Xmx is set else memory consumption keeps on increasing.
Issue can be reproduced with both MessageChannelPartitionHandler
and TaskExecutorPartitionHandler
. We are able to reproduce issue both using DB polling and request-reply channel while using MessageChannelPartitionHandler
.
What is the most convenient way to share code that reproduces issue?
Attaching a spring boot project that can reproduce issue with TaskExecutorPartitionHandler
. It requires maven and java 11 to run.
Steps to execute the program
spring-batch-remoting
directory that is created by step# 1mvn clean install
java -Xmx250m -jar target/spring-batch-remoting-0.0.1-SNAPSHOT.jar
Thank you all for for your feedback here! This is a valid performance issue. There is definitely no need to load the entire object graph of step executions when polling the status of workers.
Ideally, polling for running workers could be done with a single query, and once they are all done, we should grab shallow copies of step executions with the minimum required to do the aggregation.
I will plan the fix for the upcoming 5.0.1 / 4.3.8.
@fmbenhassine I'm afraid the issue is still present. I've checked the commit you made but since it's still working with entities, the associations are still there.
Here's a snapshot from a heap dump I've taken:
And here's the relevant stacktrace where the objects are coming from:
Scheduler1_Worker-1
at java.lang.Thread.sleep(J)V (Native Method)
at org.springframework.batch.poller.DirectPoller$DirectPollingFuture.get(JLjava/util/concurrent/TimeUnit;)Ljava/lang/Object; (DirectPoller.java:109)
at org.springframework.batch.poller.DirectPoller$DirectPollingFuture.get()Ljava/lang/Object; (DirectPoller.java:80)
at org.springframework.batch.integration.partition.MessageChannelPartitionHandler.pollReplies(Lorg/springframework/batch/core/StepExecution;Ljava/util/Set;)Ljava/util/Collection; (MessageChannelPartitionHandler.java:288)
at org.springframework.batch.integration.partition.MessageChannelPartitionHandler.handle(Lorg/springframework/batch/core/partition/StepExecutionSplitter;Lorg/springframework/batch/core/StepExecution;)Ljava/util/Collection; (MessageChannelPartitionHandler.java:251)
at org.springframework.batch.core.partition.support.PartitionStep.doExecute(Lorg/springframework/batch/core/StepExecution;)V (PartitionStep.java:106)
at org.springframework.batch.core.step.AbstractStep.execute(Lorg/springframework/batch/core/StepExecution;)V (AbstractStep.java:208)
at org.springframework.batch.core.job.SimpleStepHandler.handleStep(Lorg/springframework/batch/core/Step;Lorg/springframework/batch/core/JobExecution;)Lorg/springframework/batch/core/StepExecution; (SimpleStepHandler.java:152)
at org.springframework.batch.core.job.AbstractJob.handleStep(Lorg/springframework/batch/core/Step;Lorg/springframework/batch/core/JobExecution;)Lorg/springframework/batch/core/StepExecution; (AbstractJob.java:413)
at org.springframework.batch.core.job.SimpleJob.doExecute(Lorg/springframework/batch/core/JobExecution;)V (SimpleJob.java:136)
at org.springframework.batch.core.job.AbstractJob.execute(Lorg/springframework/batch/core/JobExecution;)V (AbstractJob.java:320)
at org.springframework.batch.core.launch.support.SimpleJobLauncher$1.run()V (SimpleJobLauncher.java:149)
at org.springframework.core.task.SyncTaskExecutor.execute(Ljava/lang/Runnable;)V (SyncTaskExecutor.java:50)
at org.springframework.batch.core.launch.support.SimpleJobLauncher.run(Lorg/springframework/batch/core/Job;Lorg/springframework/batch/core/JobParameters;)Lorg/springframework/batch/core/JobExecution; (SimpleJobLauncher.java:140)
...
at org.springframework.scheduling.quartz.QuartzJobBean.execute(Lorg/quartz/JobExecutionContext;)V (QuartzJobBean.java:75)
at org.quartz.core.JobRunShell.run()V (JobRunShell.java:202)
at org.quartz.simpl.SimpleThreadPool$WorkerThread.run()V (SimpleThreadPool.java:573)
Note: this specific job could run for hours and processes a lot of data (millions of records). When the number of partitions exceed 500 (not the threshold) the manager is slowly accumulating more and more memory. As a mitigation, I've reduced the number of partitions to 36ish and now it doesn't fail. Probably it's still consuming more and more memory but finishes before it starts to run OOM.
@galovics Thank you for reporting this.
I'm afraid the issue is still present. I've checked the commit you made but since it's still working with entities, the associations are still there.
We will always work with entities according to the domain model. What we can do is reduce the number of entities loaded in memory to the minimum required. Before https://github.com/spring-projects/spring-batch/commit/93800c6bae679fdafdced45f1c22b2c150e33ed6, the code was loading job executions in a loop for every partitioned step execution, which is obviously not necessary.
In your screenshot, I see you have several JobExecution
objects with different IDs. Are you running several job instances in the same JVM and sharing the MessageChannelPartitionHanlder
between them?
To correctly address any performance issue, we need to analyse the performance for a single job execution first. So I am expecting to see a single job execution in memory with a partitioned step. Once we ensure that a single partitioned execution is optimized, we can discuss if the packaging/deployment pattern is suitable to run several job executions in the same JVM or not.
Please open a separate issue and provide a minimal example to be sure we are addressing your specific issue and we will dig deeper. Thank you upfront.
@fmbenhassine
In your screenshot, I see you have several
JobExecution
objects with different IDs. Are you running several job instances in the same JVM and sharing theMessageChannelPartitionHanlder
between them?
That's strange to me too. I re-read the Spring Batch docs on job instances to use the same terminology and understanding and I can confirm there's a single job instance being run. In fact it's the book example of the Spring Batch docs. It's a remote partitioned end of day job (close of business (COB) as we refer to it) running once each day.
I can even show the code to you cause the project is open-source. Here's the whole manager configuration: https://github.com/apache/fineract/blob/dbfedf5cfdffbddfd400f51498c02a88c0551bd1/fineract-provider/src/main/java/org/apache/fineract/cob/loan/LoanCOBManagerConfiguration.java Here's the worker configuration: https://github.com/apache/fineract/blob/dbfedf5cfdffbddfd400f51498c02a88c0551bd1/fineract-provider/src/main/java/org/apache/fineract/cob/loan/LoanCOBWorkerConfiguration.java
Thank you for your feedback.
I can confirm there's a single job instance being run
In that case, there should really be a single JobExecution
object in memory. By design, Spring Batch does not allow concurrent job executions of the same job instance. Therefore, if a single job instance is launched within a JVM, there should be a single job execution for that instance running at a time (and consequently, a single JobExecution
object in memory). That is the setup we need to analyse the performance issue.
As mentioned previously, as this issue has been closed and assigned to a release, please open a separate one with all these details and I will take a look. Thank you upfront.
We have the same problem. I modified PR #3791so it can be merged to main branch
@galovics @pstetsuk If you find the time, it would be interesting to hear whether #4599 improves the situation for you.
@hpoettker our problem is that we have thousands steps and all of then load in-memory every time it checks step result. It leads to OutOfMemory. Your fix doesn't change this behavior and can't resolve the problem. In the fix from @galovics it doesn't load all the steps but get the count of incomplete steps from the database. It works much faster and consumer much less memory.
Bug description We found that memory consumption is fairly high on one of the service nodes that uses the Spring Batch. Even though both data nodes did a similar amount of work, the memory consumption across nodes was not even - 15GB vs 1.5GB (see memory use screenshot).
We have some jobs that could run for seconds while others might run for hours, so we set the polling interval (MessageChannelPartitionHandler#setPollInterval) to 1 second rather than 10 seconds that is the default. In a large running job scenario, we ended up creating 837 step executions.
What I found was that MessageChannelPartitionHandler#pollReplies gets a full StepExecution representation for each step, which contains a JobExecution which would also contain StepExecutions for each. However, they are retrieved at different times and stages. This means that we end up with square number of StepExecution objects, e.g. 837*837=700569 StepExecutions (see screenshot below)
Environment Initially reproduced on Spring Batch 4.1.4.
Expected behavior My proposal would be to:
Memory usage graph comparison between two service nodes, doing roughly equal amount of work:
My apologies for a messy screenshot, but it does explain the number of StepExecution objects: