spring-projects / spring-batch

Spring Batch is a framework for writing batch applications using Java and Spring
http://projects.spring.io/spring-batch/
Apache License 2.0
2.71k stars 2.34k forks source link

Wrong implementation of noRetry(...) and noSkip(...) in FaultTolerantStepBuilder #1199

Open spring-projects-issues opened 9 years ago

spring-projects-issues commented 9 years ago

Albert Strasser opened BATCH-2403 and commented

In org.springframework.batch.core.step.builder.FaultTolerantStepBuilder.java

This implementation:

public FaultTolerantStepBuilder<I, O> noRetry(Class<? extends Throwable> type) {
    retryableExceptionClasses.put(type, false);
    return this;
}

should be changed to:

public FaultTolerantStepBuilder<I, O> noRetry(Class<? extends Throwable> type) {
    nonRetryableExceptionClasses.add(type);
    return this;
}

A similar problem concerns the noSkip() method in the same class.


Affects: 3.0.3, 3.0.4

3 votes, 8 watchers

spring-projects-issues commented 9 years ago

Michael Minella commented

What makes you think this? Is there a use case that is failing for you?

spring-projects-issues commented 9 years ago

Albert Strasser commented

Thanks for responding so fast. Unfortunately yes. I am working in jdbc batch mode with chunks in fault tolerant mode (cause I need it to produce read skips). If the commit of the chunk transaction produces an UncategorizedSQLException, it becomes "swallowed" as it is not one of the hardcoded 11 fatal exceptions and the job goes to COMPLETE without having written anything (very dangerous mode for production!). To be more concise, the exceptionHandler of the RetryTemplate swallows it. Then i tried to explicetely turn off retry for the UncategorizedSQLException by using noRetry. No effect. Then I manipuated the nonRetryableExceptions field in the FaultTolerantStepBuilder by using reflection to add the UncategorizedSQLException (ugly workaround), and now it works. The exception during chunk commit makes the job fail. It can of course be that I didn't understand the concept of the two exception lists (nonRetryableExceptions, retryableExceptions) correct. But then I really wouldn't know how to tell the step that I don't want retries for UncategorizedSQLExceptions that happen during commit of the chunk transaction. PS: registering an exceptionHandler on the StepBuilder did not work either. For some reason the RetryTemplate is not initialized with my exceptionHandler, but keeps the default handler. PPS: Another workaround was to annotate the writer method with Transaction(propagaion = REQUIRES_NEW). Because then the exception happens within the writer method and then it makes the job fail. But thats not possible for me because I have to react on read skips within the same transaction than the written records. It seems strange to me though that an exception is handled differently whether it is thrown in the writer method or during chunk transaction commit. I tried to figure that out but the code was a bit too hard for me to follow.

spring-projects-issues commented 9 years ago

Michael Minella commented

Can you share your configuration?

spring-projects-issues commented 9 years ago

Daniel Guggi commented

As Albert already stated,

I also don't see an easy way to configure/modify "nonSkippableExceptionClasses" and/or "nonRetryableExceptionClasses" properties on FaultTolerantStepBuilder instances. There is afaik no (obvious) way to add elements to that collections. Those collections are only filled via "addSpecialExceptions" (private) method (maybe we're not supposed to add elements to those collections? - or am I missing something here?)

Upon build() the FaultTolerantStepBuilder's createRetryOperations() method is invoked:

RepeatOperations stepOperations = getStepOperations();
if (stepOperations instanceof RepeatTemplate) {
    SimpleRetryExceptionHandler exceptionHandler = new SimpleRetryExceptionHandler(retryPolicyWrapper,
              getExceptionHandler(), nonRetryableExceptionClasses);
    ((RepeatTemplate) stepOperations).setExceptionHandler(exceptionHandler);
}

SimpleRetryExceptionHandler instance is initialized with (hardcoded) non-retryable exception classes. Which results in (handleException() method):

//....
else {
    logger.debug("Handled non-fatal exception", throwable);
}
//...

Although the appropriate exception was set on the builder using "noRetry" method (probably noRetry() should "just" add the given exceptions to "nonRetryableExceptionClasses" collection? - same for noSkip()?). This is the step configuration we used (without success):

stepBuilders.get(PROCESSING_STEP_NAME)
        .<ObjectA, ObjectB>chunk(jobProperties().getChunkSize())
        .reader(someReader())
        .processor(someProcessor())
        .writer(someWriter())
        .faultTolerant()
        .skip(Skip1Exception.class)
        .skip(Skip2Exception.class)
        .noRetry(DataAccessException.class)
        .skipLimit(Integer.MAX_VALUE)
        .listener(someListener())
        .listener(anotherListener())
        .build();

I think we could set a custom/explicit JobOperations-Implementation (actually not an instanceof RepeatTemplate):

.stepOperations(somthingThatDoesNotExtendFromRepeatTemplate())

in order to ensure that the following (default) code is not executed (FaultTolerantJobBuilder.createRetryOperations()):

RepeatOperations stepOperations = getStepOperations();
if (stepOperations instanceof RepeatTemplate) {
    SimpleRetryExceptionHandler exceptionHandler = new SimpleRetryExceptionHandler(retryPolicyWrapper,
              getExceptionHandler(), nonRetryableExceptionClasses);
    ((RepeatTemplate) stepOperations).setExceptionHandler(exceptionHandler);
}

Is this expected behaviour?

Thanks!

spring-projects-issues commented 9 years ago

Albert Strasser commented

Thanks Daniel. You are right about the stepOperations. I experimented with it but this is no easy way because by not extending the RetryTemplate we loose everything that RetryTemplate has and does. It is not trivial to set everything manually that the step builders set on the stepOperations during build(). I decided against continuing this because it seemed very dangerous to remodel spring batch code in a customized class. This will probably lead to big issues when upgrading to newer batch versions. There should be another way.

spring-projects-issues commented 9 years ago

Erwin Vervaet commented

I am working in jdbc batch mode with chunks in fault tolerant mode (cause I need it to produce read skips). If the commit of the chunk transaction produces an UncategorizedSQLException, it becomes "swallowed" as it is not one of the hardcoded 11 fatal exceptions and the job goes to COMPLETE without having written anything (very dangerous mode for production!).

We're facing very similar issues on version 2.2.7. In our case the following exception occurs during the transaction commit:

javax.persistence.PersistenceException: org.hibernate.exception.LockAcquisitionException: could not execute statement
    at org.hibernate.ejb.AbstractEntityManagerImpl.convert(AbstractEntityManagerImpl.java:1387) ~[hibernate-entitymanager-4.2.12.Final.jar:4.2.12.Final]
    at org.hibernate.ejb.AbstractEntityManagerImpl.convert(AbstractEntityManagerImpl.java:1310) ~[hibernate-entitymanager-4.2.12.Final.jar:4.2.12.Final]
    at org.hibernate.ejb.AbstractEntityManagerImpl.convert(AbstractEntityManagerImpl.java:1316) ~[hibernate-entitymanager-4.2.12.Final.jar:4.2.12.Final]
    at org.hibernate.ejb.AbstractEntityManagerImpl$CallbackExceptionMapperImpl.mapManagedFlushFailure(AbstractEntityManagerImpl.java:1510) ~[hibernate-entitymanager-4.2.12.Final.jar:4.2.12.Final]
    at org.hibernate.engine.transaction.synchronization.internal.SynchronizationCallbackCoordinatorNonTrackingImpl.beforeCompletion(SynchronizationCallbackCoordinatorNonTrackingImpl.java:110) ~[hibernate-core-4.2.12.Final.jar:4.2.12.Final]
    at org.hibernate.engine.transaction.synchronization.internal.RegisteredSynchronization.beforeCompletion(RegisteredSynchronization.java:53) ~[hibernate-core-4.2.12.Final.jar:4.2.12.Final]
    at bitronix.tm.BitronixTransaction.fireBeforeCompletionEvent(BitronixTransaction.java:478) ~[btm-2.1.2.jar:2.1.2]
    at bitronix.tm.BitronixTransaction.commit(BitronixTransaction.java:193) ~[btm-2.1.2.jar:2.1.2]
    at bitronix.tm.BitronixTransactionManager.commit(BitronixTransactionManager.java:120) ~[btm-2.1.2.jar:2.1.2]
    at org.springframework.transaction.jta.JtaTransactionManager.doCommit(JtaTransactionManager.java:1012) ~[spring-tx-3.2.9.RELEASE.jar:3.2.9.RELEASE]
    at org.springframework.transaction.support.AbstractPlatformTransactionManager.processCommit(AbstractPlatformTransactionManager.java:755) ~[spring-tx-3.2.9.RELEASE.jar:3.2.9.RELEASE]
    at org.springframework.transaction.support.AbstractPlatformTransactionManager.commit(AbstractPlatformTransactionManager.java:724) ~[spring-tx-3.2.9.RELEASE.jar:3.2.9.RELEASE]
    at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:148) ~[spring-tx-3.2.9.RELEASE.jar:3.2.9.RELEASE]
    at org.springframework.batch.core.step.tasklet.TaskletStep$2.doInChunkContext(TaskletStep.java:267) ~[spring-batch-core-2.2.7.RELEASE.jar:na]
    at org.springframework.batch.core.scope.context.StepContextRepeatCallback.doInIteration(StepContextRepeatCallback.java:77) ~[spring-batch-core-2.2.7.RELEASE.jar:na]
    at org.springframework.batch.repeat.support.RepeatTemplate.getNextResult(RepeatTemplate.java:368) [spring-batch-infrastructure-2.2.7.RELEASE.jar:na]
...
Caused by: org.hibernate.exception.LockAcquisitionException: could not execute statement
    at org.hibernate.dialect.MySQLDialect$1.convert(MySQLDialect.java:412) ~[hibernate-core-4.2.12.Final.jar:4.2.12.Final]
...
Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLTransactionRollbackException: Deadlock found when trying to get lock; try restarting transaction
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ~[na:1.8.0_05]
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) ~[na:1.8.0_05]

The SimpleRetryExceptionHandler incorrectly categorizes this exception as non-fatal since it is not one of the hard-coded nonRetryableExceptionClasses the FaultTolerantStepBuilder sets up:

This is also evidenced by the logging:

2015-08-06 13:53:03.756 DEBUG o.s.b.repeat.support.RepeatTemplate 413 documentsIntegration documents_cuba_RLBTST9.zip - Starting repeat context.
2015-08-06 13:53:03.756 DEBUG o.s.b.repeat.support.RepeatTemplate 413 documentsIntegration documents_cuba_RLBTST9.zip - Repeat operation about to start at count=1
2015-08-06 13:53:03.756 DEBUG o.s.b.c.s.c.StepContextRepeatCallback 413 documentsIntegration documents_cuba_RLBTST9.zip - Preparing chunk execution for StepContext: org.springframework.batch.core.scope.context.StepContext@4411a81
2015-08-06 13:53:03.756 DEBUG o.s.b.c.s.c.StepContextRepeatCallback 413 documentsIntegration documents_cuba_RLBTST9.zip - Chunk execution starting: queue size=0
2015-08-06 13:53:03.758 DEBUG o.s.b.repeat.support.RepeatTemplate 413 documentsIntegration documents_cuba_RLBTST9.zip - Starting repeat context.
2015-08-06 13:53:03.758 DEBUG o.s.b.repeat.support.RepeatTemplate 413 documentsIntegration documents_cuba_RLBTST9.zip - Repeat operation about to start at count=1
2015-08-06 13:53:03.773 DEBUG o.s.batch.core.scope.StepScope 413 documentsIntegration documents_cuba_RLBTST9.zip - Creating object in scope=step, name=scopedTarget.doc.documentMarshaller
2015-08-06 13:53:03.807 DEBUG o.s.batch.core.scope.StepScope 413 documentsIntegration documents_cuba_RLBTST9.zip - Registered destruction callback in scope=step, name=scopedTarget.doc.documentMarshaller
2015-08-06 13:53:03.811 DEBUG o.s.b.repeat.support.RepeatTemplate 413 documentsIntegration documents_cuba_RLBTST9.zip - Repeat is complete according to policy and result value.
2015-08-06 13:53:04.321 DEBUG o.s.b.c.s.i.FaultTolerantChunkProcessor 413 documentsIntegration documents_cuba_RLBTST9.zip - Attempting to write: [items=[net.awl.ecs.edoc.ap.xsd.batches.document.DocumentType@66d0a725], skips=[]]
2015-08-06 13:53:04.321 DEBUG o.s.b.c.s.item.ChunkOrientedTasklet 413 documentsIntegration documents_cuba_RLBTST9.zip - Inputs not busy, ended: false
2015-08-06 13:53:04.321 DEBUG o.s.b.core.step.tasklet.TaskletStep 413 documentsIntegration documents_cuba_RLBTST9.zip - Applying contribution: [StepContribution: read=1, written=1, filtered=0, readSkips=0, writeSkips=0, processSkips=0, exitStatus=EXECUTING]
2015-08-06 13:53:04.322 DEBUG o.s.b.core.step.tasklet.TaskletStep 413 documentsIntegration documents_cuba_RLBTST9.zip - Saving step execution before commit: StepExecution: id=532, version=1, name=doc.doIntegration, status=STARTED, exitStatus=EXECUTING, readCount=1, filterCount=0, writeCount=1 readSkipCount=0, writeSkipCount=0, processSkipCount=0, commitCount=1, rollbackCount=0, exitDescription=
2015-08-06 13:53:04.432 INFO  o.s.b.core.step.tasklet.TaskletStep 413 documentsIntegration documents_cuba_RLBTST9.zip - Commit failed while step execution data was already updated. Reverting to old version.
2015-08-06 13:53:04.432 DEBUG o.s.b.repeat.support.RepeatTemplate 413 documentsIntegration documents_cuba_RLBTST9.zip - Handling exception: javax.persistence.PersistenceException, caused by: javax.persistence.PersistenceException: org.hibernate.exception.LockAcquisitionException: could not execute statement
2015-08-06 13:53:04.440 DEBUG o.s.b.c.s.i.SimpleRetryExceptionHandler 413 documentsIntegration documents_cuba_RLBTST9.zip - Handled non-fatal exception
javax.persistence.PersistenceException: org.hibernate.exception.LockAcquisitionException: could not execute statement
    at org.hibernate.ejb.AbstractEntityManagerImpl.convert(AbstractEntityManagerImpl.java:1387) ~[hibernate-entitymanager-4.2.12.Final.jar:4.2.12.Final]
...
2015-08-06 13:53:04.441 DEBUG o.s.b.repeat.support.RepeatTemplate 413 documentsIntegration documents_cuba_RLBTST9.zip - Repeat operation about to start at count=2
2015-08-06 13:53:04.441 DEBUG o.s.b.c.s.c.StepContextRepeatCallback 413 documentsIntegration documents_cuba_RLBTST9.zip - Preparing chunk execution for StepContext: org.springframework.batch.core.scope.context.StepContext@4411a81
2015-08-06 13:53:04.441 DEBUG o.s.b.c.s.c.StepContextRepeatCallback 413 documentsIntegration documents_cuba_RLBTST9.zip - Chunk execution starting: queue size=0
2015-08-06 13:53:04.503 DEBUG o.s.b.repeat.support.RepeatTemplate 413 documentsIntegration documents_cuba_RLBTST9.zip - Starting repeat context.
2015-08-06 13:53:04.504 DEBUG o.s.b.repeat.support.RepeatTemplate 413 documentsIntegration documents_cuba_RLBTST9.zip - Repeat operation about to start at count=1
2015-08-06 13:53:04.505 DEBUG o.s.b.repeat.support.RepeatTemplate 413 documentsIntegration documents_cuba_RLBTST9.zip - Repeat is complete according to policy and result value.

Consequently, the batch processing does not fail and the batch just continues processing, ultimately finishing in what appears to be a success for what was actually a failure! Very dangerous indeed!

spring-projects-issues commented 9 years ago

Erwin Vervaet commented

Albert Strasser,

I am working in jdbc batch mode with chunks in fault tolerant mode (cause I need it to produce read skips). If the commit of the chunk transaction produces an UncategorizedSQLException, it becomes "swallowed" as it is not one of the hardcoded 11 fatal exceptions and the job goes to COMPLETE without having written anything (very dangerous mode for production!)

I think there are deeper bugs here than just the FaultTolerantStepBuild issues. I've created a new JIRA ticket: BATCH-2415

spring-projects-issues commented 6 years ago

Mahmoud Ben Hassine commented

Skip/Retry features are applicable to exceptions thrown from the processor or writer. This should be clearly mentioned in the documentation. If I understand correctly, the issue here happens when the commit of the chunk’s transaction fails due to a UncategorizedSQLException (which also might be due to a timeout or some deferred db checks or a lock acquisition exception like shown in the previous comment, etc) but not due to an exception thrown from a processor or writer.

When the commit of the chunk’s transaction fails, there is no retry logic applied (which explains why using noRetry(UncategorizedSQLException.class) has no effect) and the step is expected to fail. But currently it does not fail as expected. Here is a failing test (to be added to ChunkOrientedStepIntegrationTests):

@Test
public void faultTolerantStepShouldFailWhenCommitFails() throws Exception {
   // Given
   StepBuilder stepBuilder = new StepBuilder("step");
   stepBuilder.repository(jobRepository);
   stepBuilder.transactionManager(transactionManager);
   FaultTolerantStepBuilder<String, String> faultTolerantStepBuilder = new FaultTolerantStepBuilder<>(stepBuilder);
   faultTolerantStepBuilder.reader(getReader(new String[] { "a", "b", "c" }));
   faultTolerantStepBuilder.writer(new ItemWriter<String>() {
      @Override
      public void write(List<? extends String> data) throws Exception {
           TransactionSynchronizationManager
               .registerSynchronization(new TransactionSynchronizationAdapter() {
                     @Override
                     public void beforeCommit(boolean readOnly) {
                         throw new RuntimeException("Simulate commit failure");
                     }
            });
      }
   });
   step = faultTolerantStepBuilder.build();

   JobParameters jobParameters = new JobParameters(Collections.singletonMap("run.id",
        new JobParameter(getClass().getName() + ".2")));
   JobExecution jobExecution = jobRepository.createJobExecution(job.getName(), jobParameters);
   StepExecution stepExecution = new StepExecution(step.getName(), jobExecution);

   jobRepository.add(stepExecution);

   // When
   step.execute(stepExecution);

   // Then
   assertEquals(BatchStatus.FAILED, stepExecution.getStatus());
}

Note there is no retry/skip configuration in this test and the step is still completing (but should fail). This is because by default, the FaultTolerantStepBuilder considers only nonRetryableExceptionClasses as fatal exceptions.

Even if we change the implementation of noRetry to add the exception to the nonRetryableExceptionClasses (which will make the test pass), I am not supposed (at least in my opinion) to call noRetry(MyFatalException.class) to make the step fail, because as said previously, there is no retry logic applied in the case of commit failure (to deactivate with a noRetry method call). Does this make sense?

spring-projects-issues commented 6 years ago

Erwin Vervaet commented

Mahmoud Ben Hassine,

If I understand correctly, the issue here happens when the commit of the chunk’s transaction fails due to a UncategorizedSQLException (which also might be due to a timeout or some deferred db checks or a lock acquisition exception like shown in the previous comment, etc) but not due to an exception thrown from a processor or writer.

Correct.

Even if we change the implementation of noRetry to add the exception to the nonRetryableExceptionClasses (which will make the test pass), I am not supposed (at least in my opinion) to call noRetry(MyFatalException.class) to make the step fail, because as said previously, there is no retry logic applied in the case of commit failure (to deactivate with a noRetry method call). Does this make sense?

That kind of makes sense to me. However, this of course implies a bit of a problem in Spring Batch: as shown with the UncategorizedSQLException example, it is quite common in a Spring/JPA based app to have exceptions that are only generated when the transaction commits, i.e. after the processor has finished processing. It's clear that they are currently not correctly handled by Spring Batch. Not doing the noRetry fix might of course lead to more elaborate rework to properly handle this case. See also BATCH-2415 which is closely related to this.

spring-projects-issues commented 6 years ago

Mahmoud Ben Hassine commented

However, this of course implies a bit of a problem in Spring Batch: as shown with the UncategorizedSQLException example, it is quite common in a Spring/JPA based app to have exceptions that are only generated when the transaction commits, i.e. after the processor has finished processing. It's clear that they are currently not correctly handled by Spring Batch.

I do confirm (with the failing test). It should be noted that the issue happens only with fault tolerant steps, simple steps fail as expected when the commit fails.

Not doing the noRetry fix might of course lead to more elaborate rework to properly handle this case.

Indeed, even if technically it will fix the issue, I think it is not the correct way to do it. Requiring the user to call noRetry on something that is not retryable by design makes no sense to me. We definitely prefer more elaborate work and properly fix the issue.

Currently, all stack traces provided contain the debug message "Handled non-fatal exception". There should be another way to tell Spring Batch which exception is fatal or not, something like:

stepBuilders.get("step")
        .<ObjectA, ObjectB>chunk(10)
        .reader(someReader())
        .processor(someProcessor())
        .writer(someWriter())
        .faultTolerant()
        .fatal(MyFatalException.class)
        .build();

In a previous version, there was a public method called setFatalExceptionClasses allowing the user to add an exception to be considered as fatal. Fatal semantics were defined as "should cause immediate failure". These semantics were split into non skippable / non retryable for a very good reason which is allowing more finer grained control on exceptions (What if an exception should be non skippable but retryable? See BATCH-1333).

Probably we can consider keeping the current split of semantics but also add a configurable collection of fatal exceptions. Fatal semantics should be clearly defined and documented. A fatal exception could be defined as:

  1. Non skippable
  2. Non retryable
  3. Any other exception defined by the user as fatal (which might happen at commit time).

Points 1 and 2 are for backward compatibility. Point 3 is what is missing today I think. This is just a suggestion and we will discuss it internally, but please let us know what do you think about it or suggest other ideas.

See also BATCH-2415 which is closely related to this.

Yes, thank you very much for the elaborate analysis!! The whole issue is when the commit [B] fails.

spring-projects-issues commented 6 years ago

Erwin Vervaet commented

Having a set of "fatal exception" classes of course can do the trick, but it might be a bit drastic. I would argue that the exceptions that only occur at transaction commit time are often transient, so a retry might be more appropriate. Also, one could say the work done at transaction commit time (i.e. insert flushes and the like), is just delayed work which is "logically" part of the chunk processing. As such, it would make sense the skip/retry rules also apply here.