Support for mixing per-record itemprocessors with list-based itemprocessors [BATCH-2307]

spring-projects / spring-batch

Spring Batch is a framework for writing batch applications using Java and Spring

Apache License 2.0

2.71k stars 2.34k forks source link

Chris Shumaker opened BATCH-2307 and commented

It seems easier to write ItemProcessors that handle one item at a time but sometimes, for performance and scalability reasons, it is preferable to handle a list. Currently, a user must decide between maintenance or scalability, however, this can become a nightmare if the decision changes later. Jobs with per-item processing must be converted start to end (reader, processors, writers) to handle the new paradigm. One specific example is when another framework behaves more efficiently with a chunk/batch/array than it does on a per-record basis. This is the case with most rules engines.

An almost complete suggestion might be an aggregating processor and a splitting processor to aggregate items to a list and split them to individual items. There are issues with this like "what happens to the unprocessed aggregates when the reader returns null?". Perhaps if there was some insight to the reader's return this might be possible.

Another alternative would be a configurable change to any given processor which determines how the core read, process, write pipeline works. For example, a configurable value that says chunked="true" on an itemprocessor would designate to spring batch that it should aggregate items prior to calling process for any processor marked as such. That might eliminate synchronization issues between the list-based process and non-list-based reader.

Affects: 3.0.1

Mahmoud Ben Hassine commented

The "Item" concept is abstract. Nothing prevents you from having a logical item as an aggregate of multiples physical items. This is the case for example for flat files where the target domain object spans multiple physical lines. The ItemReader<T> and ItemProcessor<I,O> are generic, so depending on how you define your "item" and how the reader provides it, the pipeline can operate on one item at a time or a list/set of items (encapsulated in a logical aggregate item).

a configurable value that says chunked="true" on an itemprocessor would designate to spring batch that it should aggregate items prior to calling process for any processor marked as such

The aggregation should happen on the reading side (as you said: "prior to calling process") so that aggregated items are sent to the processor. We provide an example with the AggregateItemReader here.

So for me, the requested feature "Support for mixing per-record itemprocessors with list-based itemprocessors" is already possible with the current chunk processing model and the generic interfaces that Spring Batch provides. It is just a matter of how to design an item. @ Chris Shumaker Do you agree?

As a side note since you talked about performance, I would like to emphasize two important points:

Readers are intended to stream items one at a time to not load the whole data set in memory. This proved to be the most efficient way of reading data in terms of performance and memory consumption. Many powerful unix tools (sed, awks, etc) and big data tools (spark, flink, storm, etc) use this model. Spring Batch is no different.
Writers are designed to write data in bulk mode which is more efficient and performs better (Think of [JDBC batch updates](https://docs.oracle.com/javase/7/docs/api/java/sql/Statement.html#executeBatch()), Elastic's Bulk indexing, Mongo's Bulk inserts, etc).

spring-projects / spring-batch

Support for mixing per-record itemprocessors with list-based itemprocessors [BATCH-2307] #1296