kagkarlsson / db-scheduler

Persistent cluster-friendly scheduler for Java
Apache License 2.0
1.25k stars 191 forks source link

Pessimistic locking cluster mode #410

Closed adamalexandru4 closed 6 months ago

adamalexandru4 commented 1 year ago

Expected Behavior

Cluster should not reach a pessimistic lock exception in Oracle DB. I have one microservice deployed in K8S with 3 replicas. Do I have mess up transactional in any way or should I configure more things ?

Current Behavior

Exception ORA-00060: deadlock detected while waiting for resource.

Steps to Reproduce the bug

  1. Configuration app.properties

    db-scheduler.enabled=true db-scheduler.heartbeat-interval=30s db-scheduler.polling-interval=10s db-scheduler.table-name=scheduled_tasks db-scheduler.immediate-execution-enabled=false db-scheduler.delay-startup-until-context-ready=true db-scheduler.threads=10 db-scheduler.polling-strategy=fetch db-scheduler.polling-strategy-lower-limit-fraction-of-threads=0.5 db-scheduler.polling-strategy-upper-limit-fraction-of-threads=3.0 db-scheduler.shutdown-max-wait=45s

  2. Running the following task

    @Bean
    public OneTimeTask<StergereProiectTask.Input> stergereProiectOneTimeTask() {
        return Tasks.oneTime(DELETE_PROJECT_TASK)
                .onFailure(
                        new MaxRetriesAlertFailureHandler<>(
                                10,
                                new FailureHandler.ExponentialBackoffFailureHandler<>(ofSeconds(10), 2)
                        )
                )
                .execute((taskInstance, executionContext) -> {
                    Span newSpan = tracer.nextSpan().name("deleteProject");
                    try (Tracer.SpanInScope ws = tracer.withSpan(newSpan.start())) {
                        deleteProjectTask.execute(taskInstance, executionContext);
    
                    } finally {
                        newSpan.end();
                    }
                });
    } 
  3. Schedule task

                    DELETE_PROJECT_TASK.instance(
                            aggregateId,
                            new DeleteProjectTask.Input(aggregateId)
                    ),
                    now()
    );
  4. Task

    @Service
    @Transactional
    @RequiredArgsConstructor
    @Slf4j
    public class DeleteProjectTask implements VoidExecutionHandler< DeleteProjectTask.Input> {
    
    public static final String TASK_NAME = "delete-project";
    
        public record Input(String projectId) {}
    
    public static final TaskWithDataDescriptor< DeleteProjectTask.Input> DELETE_PROJECT_TASK =
            new TaskWithDataDescriptor<>(TASK_NAME, DeleteProjectTask.Input.class);
    
    final EventService eventService;
    
    @Override
    public void execute(TaskInstance<Input> taskInstance, ExecutionContext executionContext) {
        String projectId = taskInstance.getData().projectId();
    ..... }

Context

Logs

c.g.kagkarlsson.scheduler.Scheduler : Unhandled exception during execution of task with name 'delete-project'. Treating as failure. .......... Caused by: java.sql.SQLException: ORA-00060: deadlock detected while waiting for resource

adamalexandru4 commented 1 year ago

I've also tried with spring boot example and used Postgres. Created a basic users table with optimistic locking and it also have failed for the same configuration. I really think that acquiring a lock should be followed by commit otherwise you can't achieve distributed lock and correct distributed scheduler.

kagkarlsson commented 1 year ago

Do I understand you correctly that you have reproduced this for Postgres as well? If it is only Oracle, and sporadic, it might be something similar to what have been observed for MSSQL (#348)

Btw, not sure why your Task-class is annotated with @Transactional?

adamalexandru4 commented 1 year ago

Yes, on Postgres the error is still present. I don't blame Oracle for this, the only issue (I can think of) may be the transaction autoCommit property ?

I'm using @Transactional because I want all task code to run under same transaction and enhance it with Spring Transaction manager, not manual handling.

kagkarlsson commented 1 year ago

I really think that acquiring a lock should be followed by commit otherwise you can't achieve distributed lock and correct distributed scheduler.

Acquiring a lock is really setting picked=true. There is always a commit after that, before the ExecutionHandler is called. I am not sure what is happening in your case. Could it be some other part of your code that is causing the deadlock? Parallell processing locking the same records?

adamalexandru4 commented 1 year ago

I understand, but are you sure that this is how it works when you use spring transaction manager? I didn't configure anything about autoCommit but it is enabled by default, right?

kagkarlsson commented 1 year ago

As long as there is a Spring-managed transaction, autocommit-settings should not matter. I have not used Spring-managed transactions that much myself, but I have tested them and verified it works (with the boot-starter)

I think you need to debug the matter more in depth. For example, enable full SQL-logging and find the relevant missing statements.

kagkarlsson commented 1 year ago

(also not sure if there are nuances when setting Transactional on the class-level)

adamalexandru4 commented 1 year ago

After some digging in I've found that the issue comes from JdbcRunner because the commitIfNecessary won't commit because the Connection is a HikariProxyConnection wrapping a PgConnection (in my case) which has autoCommit set to true.

kagkarlsson commented 1 year ago

Sorry, I don't see the issue. You might try and create an integration-test in for example db-scheduler-boot-starter that reproduces the issue.

adamalexandru4 commented 10 months ago

Did you try to start more instances/replicas ? Testing in a single instance environment worked as expected, but when you scale up things goes wrong.

kagkarlsson commented 5 months ago

Did you ever get to the bottom of this?