spring-projects / spring-batch

Spring Batch is a framework for writing batch applications using Java and Spring
http://projects.spring.io/spring-batch/
Apache License 2.0
2.7k stars 2.34k forks source link

Flush reply queue before starting remote job step. [BATCH-2652] #951

Open spring-projects-issues opened 6 years ago

spring-projects-issues commented 6 years ago

Wim Veldhuis opened BATCH-2652 and commented

Currently the ChunkMessageChannelItemWriter does not clear the reply queue before it starts executing. As a result, when there are pending results on the queue from a previous failed run, the run will fail immediately.

The ChunkMessageChannelItemWriter however assumes it is the only executing instance on its queues. It however does not clear pending requests and/or replies from a previous failed or aborted run.

The job should either fail before the step actually sends out its first chunk OR it should clear the queues before starting work.

Reason to ask is that we ran into a bug in the spring implementation that always left replies on the queue. As a result, after the first run (that failed) successive runs could no longer be started. The only way out was to flush the queue manually. In our case we check to not start a second job when the first is not finished, so pending messages on the reply queue are always obsolete and should be cleared. Similarly the request queue could also be cleared.

The message we got showed there were mixed up replies, but it was not clear those were from a previous failed run.


No further details from BATCH-2652

spring-projects-issues commented 6 years ago

Michael Minella commented

With remote chunking we cannot just blindly clear the queues since that would result in data loss. This is really a business decision that would require human intervention unless I'm missing something.

spring-projects-issues commented 6 years ago

Wim Veldhuis commented

I understand that it is a business decision, but the current implementation does not support specifying the decision.

I would propose a a flag that allows for (at least 2) scenarios:

  1. If messages are detected on the reply queue (maybe also the request queue?) the job should fail immediately.

  2. If messages are detected on the reply queue (maybe also the request queue?) the job should remove these before starting.

The current situation is that one message is read (and probably lost) and the job then fails. The job already did all the work for the reader etc, which could lead to additional work during cleanup. The check on pending messages and handling them ideally takes place before the reader is initialized.

Note: In our local implementation we empty the reply queue. If we remove messages we actually wait for a short interval to see if new ones pop up and fail if they do. This could be the result of a slave still processing while the master is restarted and triggers the job again before the slaves processed all the pending messages. This could be handled with another property if useful in general.