spring-projects / spring-batch

Spring Batch is a framework for writing batch applications using Java and Spring
http://projects.spring.io/spring-batch/
Apache License 2.0
2.64k stars 2.31k forks source link

Spring Batch and Reactive Streams [BATCH-2596] #1008

Open spring-projects-issues opened 7 years ago

spring-projects-issues commented 7 years ago

Christian Trutz opened BATCH-2596 and commented

Hello Spring Batch team,

are there any plans to incorporate Reactive Streams (https://projectreactor.io/) in Spring Batch? I think about concepts like backpressure (between ItemReaders and ItemWriters within a Spring Batch step). I talked today with Mark Paluch @JAX and he suggested to open an discussion issue.

Christian


1 votes, 4 watchers

spring-projects-issues commented 7 years ago

Michael Minella commented

We have had some conversations about it. No concrete plans yet. We'll keep you posted.

waqaskamran commented 4 years ago

@spring-issuemaster any update on this thread ?

philipbel commented 3 years ago

Any updates? It's been 3+ years since the original request.

mminella commented 3 years ago

@philipbel We're really interested in working with someone with a concrete use case for this. If you have one you can share, let us know!

philipbel commented 3 years ago

@philipbel We're really interested in working with someone with a concrete use case for this. If you have one you can share, let us know!

@mminella, I am working on a server where upon user signup, a job is triggered to fetch some resources from 3rd-party ReST APIs and populate a MongoDB collection for the user. A second use case is to periodically (once every few hours) run a batch job, again fetching some external resources, and doing a computation on them, storing the result in MongoDB and in a 3rd-party ReST API. A third use case I have is running machine learning jobs (tied to AWS) once a day, which, too, access external resources and store their results in MongoDB.

For all jobs, I would like to provide an overview to the user of what the job's status is, preferably with a progress. In terms of data, not much, < 10 MB of JSON per job per user. Each job has a few steps.

The application uses WebFlux + WebClient and MongoDB Reactive.

mminella commented 3 years ago

@philipbel So what are you looking to get out of a Reactive Streams integration with Spring Batch?

omar-napoleon commented 3 years ago

Hello. Is there a way to integrate rsocket (reactor) into an ItemProcessor? I have the following problem. I have to call a server with tcp but when sending the data I need the response to process the data and be able to send it to the itemwriter, but waiting for the response delays all my processing. I need the communication to be asynchronous with the server but not to queue the next iterations of the itemprocessor. I have already converted the itemProcessor to asynchronous but in the same way when I run out of threads due to the delay of the external server my processing is delayed. I must meet a minimum amount of TPS to send to the server.

fmbenhassine commented 3 years ago

Spring Batch and Reactive Streams

First of all, Reactive Streams are designed for data streams (infinite) while Spring Batch is a batch processing framework designed for fixed datasets (finite). In my opinion, this is already a fundamental mismtatch that can result in a non-natural integration between these two tools.

Now even if we try to introduce "reactive" in some way in Spring Batch, we need to be careful about several design choices. The following excerpt from the Spring Framework FAQ section is key:

For handlers to be fully non-blocking, you need to use reactive libraries throughout the processing chain,
all the way to the persistence layer.

Spring Framework even recommends to keep using the blocking model if the stack is partially reactive:

By all means, keep using Spring MVC if you are developing web apps that don't benefit from a non-blocking
programming model, or that use blocking JPA or JDBC APIs for persistence (typically in combination with
thread-bound transactions).

Based on these statements, for a web application to be fully reactive, the entire stack should be reactive, from the controller all the way down to the persistence layer. This is no different for batch applications, except that we obviously don't have controllers here, but the end-to-end job execution should be reactive. It does not make sense to have a "reactive non-blocking step" that interacts with a blocking job repository.

So to really benefit from this reactive story, the entire framework should be reactive, from batch artefacts (reader, processor, writer, listeners, etc) to infrastructure beans (job repository, transaction manager, etc). And to achieve that, a huge effort is needed:

Moreover, the current chunk-oriented processing model is actually incompatible with the reactive paradigm. The reason is that ChunkOrientedTasklet waits for the chunkProcessor (processor + writer) to process the whole chunk before reading the next chunk:

Chunk inputs = chunkProvider.provide();
chunkProcessor.process(inputs);

So this implementation should be adapted as well. And all these changes are required without even talking about the current concurrency model of Spring Batch (that is incompatible with the reactive paradigm) and the optimistic locking strategy used at the job repository level..

In my opinion, "reactive support" is not a feature we can introduce in Spring Batch, it actually requires a complete re-write of 80% of the framework (if not more). For all these reasons, I believe the cost/benefit of such integration is too high to be considered, but I'm open to be convinced otherwise.

jjmargon commented 3 years ago

Very interesting last comment by benas. It's true that Spring Batch has the "chunk" (i.e., finite data) concept deeply extended in the framework. Also, in Batch applications or jobs, you don't usually have concurrency issues as you do in APIs with an external interface (ex, a REST API). Recall that Reactive applications won't be faster. They're just there to optimize computing resources with IO operations and support more concurrent users with the same resources than a thread-per-request model. However... Batch applications in the enterprise world are really important. There is a lot of focus nowadays in web APIs and things like that, but many apps have their batch counterpart to do a lot of business needs. Just a lot... So, it's not rare to have an enterprise use case with hundreds or thousands of concurrent job (or even task) executions. Different job definitions, but executing concurrently. Also, by Batch job's nature, these executions have a lot of IO operations. IMHO, it's very important that Spring could have a solution for Reactive Batch. For example, in Spring Cloud Data Flow, imagine a server with just a couple of threads managing all job executions in the event-loop paradigm. It would be awesome in terms of computing optimization, and in the cloud world, it would mean a huge saving in terms of cost. In any case, I think that you are right. It's virtually impossible to adapt the current Spring Batch (and Cloud Data Flow) project to support a reactive model. To give this reactive support to batch applications, I think the only solution for this is to create a new project for this goal.

omerk706 commented 2 years ago

Are there any updates on this subject? I do see the point in @benas comment, and I understand there will be a lot of rewriting to do. But wasn't a similar thing done to allow webflux?

I see a usecase in my workplace where a huge amount of data being read from a Hadoop cluster record by record, collecting data from external services for each record, then processing and submitting the result on a message queue.

The first part of this job is to accumulate all of the Hadoop records to a file on the local batch job file system, then read them in the next section, collecting them to subsets of data to be process by the next tasklet.

I see how different this is from reading one record, then .zipWith( fetchDataFromExternal...) and in the end submit it to message queue. However, apart from the huge file download, the rest feels like a reactive is the natural solution, and if reactive existed the way it is now when spring batch was designed, wouldn't reactive be the natural choice for that kind of flow?

toniocus commented 1 year ago

Let me add a little Use Case I'm facing nowadays in an environment moving to cloud services, I'm pretty new to reactive programming, just a few weeks, but a very experienced Java developer, so I might make some big mistakes about things I'm not aware off, sorry for it if it is the case.

We have a few batch-jobs, some using spring-batch, some not, that basically take care of getting some information needed for business, the company is moving to a SF app + cloud services that deal with some services out SF.

So as a consequence some 'BusinessRules' are moving from being replicated in a lot of places to more something like SOA where we need to call REST APIs (in cloud and in SF) to apply those rules, instead of having them replicated in our code.

I guess at some point, but really not sure, this batch jobs might be replaced with some better Architectural Solution but for now, and probably for a couple of years, our batch applications need to replace DB calls or replicated code logic, with calls to our rest APIs.

So I started investigating basically how can I use new reflective WebClient in batch applications, and really facing a lot of wired things (for a new reactive programmer probably), when I discover this thread and I thought may be I'm not the only one facing this kind of problem.

Hope this adds something to this thread, and by the way if anyone can point me to any articles that discuss a similar problem I'll be very grateful.

Thanks in advance,

Sam-Kruglov commented 1 year ago

@toniocus hey, why don’t you just convert everything to synchronous programming? You could use something like Feign to do your calls or if you have to use webclient you can always just do .block(). Or you could write everything in a reactive way for future proofness and put all that code in a separate module but in spring batch you just call block on it at the very end.

toniocus commented 1 year ago

@Sam-Kruglov, thanks for the answer.

Well using blocking calls or reactor block() (on a single thread) has the problem that REST calls take about 100th millis, while previous 'local implementation' takes about 10th millis or less, which is an important impact.

So basic ideas, use blocking calls in multiple threads (like 10 or so), or use reactor without blocking (of course at some point you should block, but expected to do it not in every call, that is what I'm working on while thinking how to move to full-reactive approach)

Anyhow this is out of scope for this thread, I was just wondering if it is worth adding some simplification for this case, more a question than a request.

Thanks again.

omerk706 commented 1 year ago

@toniocus I'm no expert, but I've gained some millage with reactive, and specifically reactor, and I don't see a case where a call to block() is favorable over using the built in back pressure support. Reactive design "assumes" you won't block, and should avoid it at all cost (don't break the chain principle is broken when you block).

Both the server side reactive, and the WebClient (by spring) are not running the thread-per-request model anymore. Reactive has an event-looper threading model that abstracts away the thread related code, but you need to understand how it works, and how blocking will affect your runtime. Reactive implementations don't spawn a new thread if suffocated, they just hang, waiting for the blocking call to end. If an event looper is scheduling tasks on a blocked thread that's waiting for IO to complete, that application will eventually halt. This is due to the way reactor (for example) is using the same event looper for the entire application, making it extremely efficient thread wise, but at the same time very vulnerable to performance issues should block() be used on the event looper. You could of course, schedule tasks with .scheduleOn(Schedulers.newThread), but that will be in the cost of context switch, which is one of the main advantages of reactive.

Bottom line is, migrating the code to reactive is not a solution, unless the whole application is migrated, and planned accordingly.

I suggest trying to gain more understanding, especially around the subscription life cycle, where back pressure is used.

toniocus commented 1 year ago

@omerk706, wow thanks for all your tips, dealing almost with all of them as I can recall :-), but lets don't bother on this thread which is a spring-batch one, just understanding how reactor works, and every now and then I get surprised how I misunderstood some things.

Thanks again for your interest in helping, when I get something that I like , I'll start a new thread probably in reactor-list, so I receive a list of all my mistakes :-).

Thanks everybody for the interest shown to my post.

Sam-Kruglov commented 1 year ago

@toniocus yeah until this issue is resolved you won’t get much benefits in moving to reactive.

So you can either migrate your code to reactive or stay synchronous. If you migrate now, you have to block every reactive stream (probably with .subscribeOn(elastic())) which will make your code essentially behave the same but with event loop overhead, so it’ll be a little slower (probably insignificant). On the other hand though, you get future proofing, when this issue is resolved you can remove all block calls, configure spring batch and you’re done.

riloki commented 1 year ago

I have found that partial reactive can still be useful in certain situations even within a synchronous batch framework like Spring Batch. As an example of such a job, suppose each input item requires a call to two or more vendor REST APIs, and the multiple results of each of those calls requires a secondary call to another REST API to fill in key details. These will be aggregated to formulate the output.

In the synchronous case, the time to process each item would be the sum of the round trips of all the first-stage and second-stage calls because the calls are laid "end-to-end". Using WebClient and flatmap(*), the reactive WebClient can overlap all the calls for that item without my code needing to explicitly allocate threads/pools. Although the code blocks at the end of processing each item, it still saves a bunch of time and programming complexity over both fully synchronous and thread-explosion approaches.

So, I would caution against sweeping judgments one way or the other. With regard to Spring Batch, I do believe a new project may be in order (Spring Reactive Batch?) because it could actually simplify things as well as improve performance and possibly lower resource usage. That said, we can still gain some reactive benefits for batch jobs with certain characteristics using the current version of Spring Batch.

msoares80 commented 1 year ago

Hi,

I have a use case that I think would be very useful to integrate into Spring Batch in a reactive way. If the library requires a complete rewrite of 80% of the framework, I think it would be better to create a new one that is adapted for reactive programming.

My use case would involve creating a "map-reduce" functionality, and the parallelism of the reactive mode would be very important for this.

My problem is that I need to process a huge dataset in a Postgres database, and we have developed a first version in Spring Batch that tries to process the entire dataset at once. However, this treatment seems to be blocked because it is too slow. Our use case does not involve infinite data; when the batch starts, the data is finite but very large.

Our idea would be to divide the dataset into subdatasets and launch a process for each one. All of these processes would be parallelized, of course. If it were possible to declare a pool of agent servers to distribute the processes between them, that would be perfect. When one of these processes finishes, using the reactive approach, we could call a new map process if necessary or move on to the reduce operation.

Thank you.

hantsy commented 1 year ago

R2dbc Batch is a good match for processing a chunk of data, but it is not supported in Spring data R2dbc.

And another issue is the R2dbc transaction in a chunk.