spring-cloud / spring-cloud-dataflow

A microservices-based Streaming and Batch data processing in Cloud Foundry and Kubernetes
https://dataflow.spring.io
Apache License 2.0
1.12k stars 583 forks source link

Getting deadlock error while executing a task #5934

Open akshaylad26 opened 2 months ago

akshaylad26 commented 2 months ago

Description: We are getting intermittent deadlock issue while executing tasks. We observed that this usually happens when our execution cleanup job is running in parallel. Below is the stacktrace. We are unable to reproduce this issue in lower environments, but it is frequently happening in client environments. Can you please check? If any details required, please let me know, I will extract it.

Release versions: We are using 2.11.4 release version.

org.springframework.dao.CannotAcquireLockException: PreparedStatementCallback; SQL [INSERT INTO BOOT3_BATCH_JOB_EXECUTION(JOB_EXECUTION_ID, JOB_INSTANCE_ID, START_TIME, END_TIME, STATUS, EXIT_CODE, EXIT_MESSAGE, VERSION, CREATE_TIME, LAST_UPDATED) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?) ]; Deadlock found when trying to get lock; try restarting transaction at org.springframework.jdbc.support.SQLExceptionSubclassTranslator.doTranslate(SQLExceptionSubclassTranslator.java:78) at org.springframework.jdbc.support.AbstractFallbackSQLExceptionTranslator.translate(AbstractFallbackSQLExceptionTranslator.java:73) at org.springframework.jdbc.core.JdbcTemplate.translateException(JdbcTemplate.java:1572) at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:667) at org.springframework.jdbc.core.JdbcTemplate.update(JdbcTemplate.java:960) at org.springframework.jdbc.core.JdbcTemplate.update(JdbcTemplate.java:1015) at org.springframework.jdbc.core.JdbcTemplate.update(JdbcTemplate.java:1020) at org.springframework.batch.core.repository.dao.JdbcJobExecutionDao.saveJobExecution(JdbcJobExecutionDao.java:243) at org.springframework.batch.core.repository.support.SimpleJobRepository.createJobExecution(SimpleJobRepository.java:178) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:568) at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:343) at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:196) at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163) at org.springframework.transaction.interceptor.TransactionInterceptor$1.proceedWithInvocation(TransactionInterceptor.java:123) at org.springframework.transaction.interceptor.TransactionAspectSupport.invokeWithinTransaction(TransactionAspectSupport.java:391) at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:119) at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:184) at org.springframework.batch.core.repository.support.AbstractJobRepositoryFactoryBean.lambda$getObject$0(AbstractJobRepositoryFactoryBean.java:204) at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:184) at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:244) at jdk.proxy2/jdk.proxy2.$Proxy94.createJobExecution(Unknown Source) at org.springframework.batch.core.launch.support.SimpleJobLauncher.run(SimpleJobLauncher.java:145) at org.springframework.batch.core.launch.support.TaskExecutorJobLauncher.run(TaskExecutorJobLauncher.java:59) at org.springframework.cloud.task.batch.handler.TaskJobLauncherApplicationRunner.execute(TaskJobLauncherApplicationRunner.java:147) at org.springframework.boot.autoconfigure.batch.JobLauncherApplicationRunner.executeLocalJobs(JobLauncherApplicationRunner.java:195) at org.springframework.boot.autoconfigure.batch.JobLauncherApplicationRunner.launchJobFromProperties(JobLauncherApplicationRunner.java:175) at org.springframework.cloud.task.batch.handler.TaskJobLauncherApplicationRunner.run(TaskJobLauncherApplicationRunner.java:112) at org.springframework.boot.autoconfigure.batch.JobLauncherApplicationRunner.run(JobLauncherApplicationRunner.java:165) at org.springframework.boot.SpringApplication.lambda$callRunner$4(SpringApplication.java:771) at org.springframework.util.function.ThrowingConsumer$1.acceptWithException(ThrowingConsumer.java:83) at org.springframework.util.function.ThrowingConsumer.accept(ThrowingConsumer.java:60) at org.springframework.util.function.ThrowingConsumer$1.accept(ThrowingConsumer.java:88) at org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:783) at org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:771) at org.springframework.boot.SpringApplication.lambda$callRunners$3(SpringApplication.java:759) at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183) at java.base/java.util.stream.SortedOps$SizedRefSortingSink.end(SortedOps.java:357) at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:510) at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499) at java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150) at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173) at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:596) at org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:759) at org.springframework.boot.SpringApplication.run(SpringApplication.java:332) at org.springframework.boot.SpringApplication.run(SpringApplication.java:1327) at org.springframework.boot.SpringApplication.run(SpringApplication.java:1316) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:568) at org.springframework.boot.loader.MainMethodRunner.run(MainMethodRunner.java:49) at org.springframework.boot.loader.Launcher.launch(Launcher.java:95) at org.springframework.boot.loader.Launcher.launch(Launcher.java:58) at org.springframework.boot.loader.JarLauncher.main(JarLauncher.java:65) Caused by: com.mysql.cj.jdbc.exceptions.MySQLTransactionRollbackException: Deadlock found when trying to get lock; try restarting transaction at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:124) at com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:122) at com.mysql.cj.jdbc.ClientPreparedStatement.executeInternal(ClientPreparedStatement.java:916) at com.mysql.cj.jdbc.ClientPreparedStatement.executeUpdateInternal(ClientPreparedStatement.java:1061) at com.mysql.cj.jdbc.ClientPreparedStatement.executeUpdateInternal(ClientPreparedStatement.java:1009) at com.mysql.cj.jdbc.ClientPreparedStatement.executeLargeUpdate(ClientPreparedStatement.java:1320) at com.mysql.cj.jdbc.ClientPreparedStatement.executeUpdate(ClientPreparedStatement.java:994) at com.zaxxer.hikari.pool.ProxyPreparedStatement.executeUpdate(ProxyPreparedStatement.java:61) at com.zaxxer.hikari.pool.HikariProxyPreparedStatement.executeUpdate(HikariProxyPreparedStatement.java) at org.springframework.jdbc.core.JdbcTemplate.lambda$update$2(JdbcTemplate.java:965) at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:648) ... 55 common frames omitted

corneil commented 2 months ago

@akshaylad26 Is this happening in a Batch Task application that is launched by Spring Cloud Data Flow or using a Schedule?

corneil commented 2 months ago

@akshaylad26 Can you provide more information on your cleanup job?

cppwfs commented 2 months ago

@akshaylad26 Please provide the version of MySql and the platform you are launching the applications (Kubernetes, local, Cloud Foundry). Thanks!

akshaylad26 commented 2 months ago

We are launching using Kubernetes & MySQL version is 8.0.33 Both cleanup job & batch task application are scheduled operations. Cleanup job executes every 6 hours whereas task is scheduled every hour. In Cleanup job we get executions using /tasks/executions endpoint in a paginated way sorted by task execution id in ascending order. Then we use /task/executions/{list of executionIds}?action=REMOVE_DATA,CLEANUP&schemaTarget=boot3 endpoint for cleaning up older executions. We follow this way because normal cleanup API provided by SCDF doesn't clean executions without any start/end date.

Let me know if this helps.

cppwfs commented 2 months ago

What transaction isolation level are you using for your batch application(s)?

akshaylad26 commented 2 months ago

We use default isolation level. We use hikari datasource & dont override it.

cppwfs commented 2 months ago

You may wish to consider setting the transaction isolation level of your apps to something like ISOLATION_REPEATABLE_READ as discussed in Spring Batch's Javadoc.

akshaylad26 commented 2 months ago

sure will give it a try. Thanks!

akshaylad26 commented 2 months ago

Creating new transaction with name [org.springframework.cloud.dataflow.server.service.LauncherInitializationService.initialize]: PROPAGATION_REQUIRED,ISOLATION_DEFAULT

I see this line in SCDF server DEBUG logs as well. Does this also need to be updated to ISOLATION_REPEATABLE_READ?

cppwfs commented 2 months ago

Thank you for your update! For now SCDF needs to keep its current transaction isolation level.

akshaylad26 commented 2 months ago

I just checked in our Database, its default isolation level itself is REPEATABLE_READ.

image

cppwfs commented 2 months ago

Understood. Not sure that necessarily affects the Batch apps in that it sets its own default.

akshaylad26 commented 2 months ago

Hi @cppwfs ,

We tried adding transaction isolation level = ISOLATION_REPEATABLE_READ in our batch application. But today again got the same deadlock issue.

cppwfs commented 2 months ago

Can you share a small sample application that reproduces the issue? Thanks

akshaylad26 commented 1 month ago

Hi @cppwfs ,

We were not able to reproduce issue. However, attaching innodb status file from mysql which shows details about deadlock. deadlock.txt

Hope it helps to investigate further.

akshaylad26 commented 2 weeks ago

DELETE FROM BOOT3_BATCH_JOB_INSTANCE WHERE NOT EXISTS ( SELECT JOB_INSTANCE_ID FROM BOOT3_BATCH_JOB_EXECUTION WHERE BOOT3_BATCH_JOB_INSTANCE.JOB_INSTANCE_ID = JOB_INSTANCE_ID)

I believe this query holds the lock of entire table rows which causes the problem.