On retry with rollback, we only want to process the most recent chunk with item-size = 1, even with custom checkpoint algorithm

Originally opened as bug 6457 by ScottKurz

--------------Original Comment History---------------------------- Comment from = ScottKurz on 2014-10-07 11:09:27 +0000

At the end of Sec. 11.9 in "Rollback Procedure"

...

<->ItemWriter.open // thread A, pass last committed checkpoint info
<->ItemReader.open // thread A, pass last committed checkpoint info ... it clearly says to call open with the last committed checkpoint (strikes me as odd that it would say to call the writer open first.

The RI doesn't do this.

Since the TCK is weak in this area let's also look over carefully the RI code with the text in "8.2.1.4.4" in mind:

When a retryable exception occurs, the default behavior is for the batch runtime to rollback the current chunk and re-process it with an item-count of 1 and a checkpoint policy of item. If the optional ChunkListener is configured on the step, the onError method is called before rollback.

Comment from = ScottKurz on 2014-10-07 14:33:37 +0000

Sorry... the problem as I'd originally written it up is not a problem. I got confused unwrapping a pile of exceptions.. my mistake.

The RI does indeed call open with the last checkpoint on retry-with-rollback.

One thing we're doing wrong though: we should only be processing the last chunk of data, in one at a time in one-item chunks. Instead we go and process the rest of the step in the one-item chunk mode.

It also looks like, from code inspection, that we fail to implement this behavior in the case of a custom checkpoint algorithm. I read 8.2.1.4.4 to imply that we should use the single-item chunk even if there had been a custom checkpoint algorithm. (This implies that we count the previous "working chunk size".. I don't think anyone would suggest using another method to keep track of when we've reprocessed the previous chunk.)

So I renamed the bug accordingly to capture this problem.

Comment from = cf126330 on 2014-10-07 21:38:32 +0000

I just checked what we did in JBeret chunk processing: during a retry, it doesn't use or modify itemCount value, since the retry is by single item. After the retry, it's back to normal phase and the original itemCount or checkpoint policy resumes.

Comment from = ScottKurz on 2014-10-23 16:26:00 +0000

So, getting back to this thread on retry-with-rollback.

A tricky point is how do we know when we're done processing the "original chunk"? I.e. when should we stop processing the data in single-item chunks (one-at-a-time), and resume using the normal chunk boundary.

The custom CheckpointAlgorithm complicates things. How do we know when the original chunk would have been checkpointed by the CheckpointAlgorithm?

Here's how I'd naturally propose to solve this problem, (given that the RI has so far sidestepped this):

If you are using item-count to checkpoint, then, on retry-with-rollback, we read-process-write (item-count)# of items one-at-a-time before reverting to normal (item-count)#-sized chunks.

If you are using CheckpointAlgorithm, we use the number of items read up until the point the retryable exception occurred. I.e. we read-process-write #read one-at-a-time before reverting to CheckpointAlgorithm-defined chunks.

How does that sound? What are the other implementations doing?

Cheng, I don't think your response went into that level of detail, or if you meant it to I missed it. (Thanks).

If this sounds good I could take a stab at writing this up more formally.

Comment from = cf126330 on 2014-10-23 17:08:13 +0000

org.jberet.runtime.runner.ChunkRunner has a couple of internal fields to track progress:

readPosition: Current position in the reader input data

checkpointPosition: the last checkpoint position in the reader intput data

failurePoint: Where the failure occurred that caused the current retry. The retry should stop after the item at failurePoint has been retried.

These 3 fields are synchronized with each other at various points. During a retry, readPosition rewinds to the last checkpointPosition and increments from there until it reaches the last failurePoint. After that point, all items are back to normal treatment. So whether we use item-count or custom checkpointAlgorithm, we only retry failed items. I think that's the basic retry flow in JBeret.

In your proposal, if we read-process-write (item-count)# of items one-at-a-time, that may result in many single-item checkpointing, especially if it fails early during a chunk (e.g., failed at 3rd item when item-count is 100).

Comment from = ScottKurz on 2014-10-30 20:57:12 +0000

Thanks for your explanation Cheng.

Agree your implementation is better. I just committed something similar in:

https://github.com/WASdev/standards.jsr352.jbatch/commit/1679181c55a5267a15feade1bba35fffd4d401c4 (this fixed a few other bugs too).

Doing so made me give up on the thought that we should amend the spec any here. Let's call it a day and let this stand as an RI bug.

There are still a couple small ambiguities... e.g. what happens if one of my items gets filtered (by the processor).. do I read another or immediately commit the tran (the MR update clarifies no writeItems is called if all items have been filtered)? I'd entertained the spec update idea thinking a change was relatively unlikely to break any app behavior.. but I don't want to get into a whole set of things..

So a retry-with-rollback exception on read and process only retries up until the last-read item 1-by-1 before reverting to normal chunk processing, though a retry-with-rollback exception on write retries the whole chunk.

Comment from = ScottKurz on 2015-01-05 19:31:41 +0000

Consider if something like Roberto's suggestion would be helpful:

"After the faulty chunk is retried, the implementation should/must return to the configured checkpoint policy. "

See: https://java.net/projects/jbatch/lists/public/archive/2014-12/message/13

Assigning to SPEC.

Note: The link above is bad, but we had it archived. Here it is:

Note from Roberto Cortez 21 Dec 2014

Hi everyone, I've been working and testing the behaviours of Batch implementations because ofÂ https://github.com/javaee-samples/javaee7-samples/issues/281Â (related with skip and retry exceptions). We're observing different behaviours with Glassfish (JBatch) and Wildfly (JBeret). I believe that we need a few Spec clarifications in order to fix the issues. The Spec states:Â Â 8.2.1.4.4 Default Retry Behavior - Rollback When a retryable exception occurs, the default behavior is for the batch runtime to rollback the currentchunk and re-process it with an item-count of 1 and a checkpoint policy of item.Â
It's not very clear what should happen with subsequent chunks. Current GF 4.1 just process the rest of the step in a 1 by 1 element in case of a retry. Jberet just process the current chunk 1 by 1. I did find an issue about this in JBatch:Â https://java.net/bugzilla/show_bug.cgi?id=6457. JBatch will follow ; the behaviour of JBeret, but probably the Spec text could use an improvement / clarification on this. Something like: When a retryable exception occurs, the default behavior is for the batch runtime to rollback the current chunk and re-process it with an item-count of 1 and a checkpoint policy of item. After the faulty chunk is retried, the implementation should/must return to the configured checkpoint policy.Â About Skip:Â

Skipping on Reader and Processor it's not a problem since they handle one item at a time, but Writer handle a list of items. Should all the items be skipped? Or somehow try to detect the faulty element and skip only that one? Example: Read Item 1, 2, 3Process Item 1, 2, 3Write Item 1, 2, 3 (faulty element) You already handed items 1, 2 and 3 to the Writer. To know which element is failing you need to retry with an item count of one. This works fine with Retryable Exceptions, but what if you only have Skippable Exceptions configured? I couldn't find in the Spec what should be the behaviour about this with only Skippable Exceptions and Writers. I guess that everything in the Writer should be skipped and if you want the proper behaviour, just configure the same Exception for Retry and Skip. Maybe we should include a clarification about this as well in the SPEC? Sorry for the long email. Cheers,Roberto

jakartaee / batch