spring-projects / spring-modulith

Modular applications with Spring Boot
https://spring.io/projects/spring-modulith
Apache License 2.0
807 stars 139 forks source link

Overhaul event publication lifecycle #796

Open odrotbohm opened 2 months ago

odrotbohm commented 2 months ago

The persistent structure of an event publication currently effectively represents two states. Their default state captures the fact that a transactional event listener will have to be invoked eventually. The publication also stays in that state while the listener processes the event. Once the listener succeeds, the event publication is marked as completed. If the listener fails, the event publication stays in its original state.

This basic lifecycle is easy to work with, but has a couple of downsides. First and foremost, we cannot differentiate between publications that are about to be processed, ones that are processed and ones that have failed. Especially the latter is problematic, as a primary use case supported by the registry is to be able to recover from erroneous situations by resubmitting failed event publications. Developers usually resort to rather fuzzy approaches like considering events that have not been completed in a given time frame to be incomplete.

To improve on this, we’d like to move to a more sophisticated event publication lifecycle that allows to detect failed ones easier. One possible way to achieve this would be to introduce a dedicated status field, or — consistent with the current approach of setting a completion date — a failed date field which would need to be set in case an event listener fails. That step, however, might fail as well, as the erroneous situation that leads to the event listener failing in the first place. That’s why it might make sense to introduce a duration configuration property, after which incomplete event publications might be considered incomplete as well. The feature bears a bit of risk, as we will have to think about the upgrade process of Spring Modulith applications. Existing apps might still contain entries in the database of incomplete event publications.

Ideas / Action Items

Related tickets

breun commented 1 month ago

Does not being able to tell that an event is being processed also mean that currently multi-instance apps are not an option?

I’m not a database expert, but I believe at least PostgreSQL supports row-level locking, which would allow concurrent processing of events by multiple instances, unlike some leader election method.

annnoo commented 1 month ago

@breun
We are currently using spring-modulith and the event-publication in one of our projects. Multi-Instance apps are possible, but the misconception we had is that the whole event publication log is not a "Message Queue" - it is currently a publication log. The whole table just keeps track what event's have been sent and allows you to retry them on startup - and he it can't distinguish between currently being processed events and "stuck" events.

The whole processing (sending event -> handler -> mark as finished) is not done by "submitting" the event to the table and a "worker" picks it up. The processing happens always on the instance the Event has been sent in the first place.

The publication_log is just there to keep track of what events have been processed. And the only information you currently have is if there is a completion_date on the event and when it got published.

We've built our own retry mechanism around the log, in which we are retrying events that are at least n-minutes old, but because it is an "publication_log" we have the issue that sometimes events get processed multiple times, when an event takes a long time to be processed (either because one of the steps take a long time or when a lot of events get sent and they can't get processed because all threads in our thread pool are already busy).

And that's where this misconception comes into place. If you work around the fact that the table is not used for processing at all than you may get into these issues if you build your retry mechanism.

My thoughts around this topic

We would really wish for a way to distinguish between events that are currently being processed and events that have failed, but in all implementations there are edge cases which you may or may not support in spring-modulith.

If you have a dedicated status field (e.g. SUBMITTED, PROCESSING, FINSIHED, FAILED) you can easily find out which events to retry, based on the failed field and can skip the PROCESSING ones - unless you have events that are struck, because the instance went down when processing them. In order to identify them in a multi-instance setup you would have to keep track of which instance is currently active.

If you handle it via a failedDate column you have to identify the currently being processed ones via an offset (as described in the issue description) - but here you have to be careful with longer running tasks, because as i mentioned, it can happen that it takes a few minutes until an event is picked up (because of all threads are being utilized)

In that case it could make sense to also have an column for when the Event got picked up and the handler is being triggered....

Conclusion

Thinking more about it, a big problem with the event_publication table is the misconception I mentioned. For example I expected that the event_publication could be seen as a "light" version of an event-externalization, but it does definetely does not work that way (and probably shouldn't be used in that way). From my gut feeling (and talking with colleagues about that) it feels like that I am not the only one that stepped into that.

Maybe I am jumping a little bit to a different topic, where this issue isn't about, but I think it should be clearer from the docs that the current event_publication mechanism should not be seen as as an externalized event processing mechanism and should show up the limitations of that.

And regarding what the users are expecting (and what @breun even mentioned) Could it make sense to have "some kind" of event-externalization for postgres or other databases? Or at least some functionality for putting the message processing more to the database? I know this is a lot of work (and definetely not part of the issue - I just wanted to mention it here) , but I have a feeling that that's what developers want and "see" in the event_publication, what it is not.

Edit: Removed

(unless you use the event externalization, meaning that events are sent and handled via Kafka, SQS, SNS... etc. - haven't touched this to be honest)

I just took a look into the docs and Event-Externalization means that you just publish events to other systems so that other applications can get them - not that you consume them from these.

aahlenst commented 2 days ago

It is unclear to me what Modulith's responsibility should be: ensuring the delivery of events only or also dealing with problems.

To ensure the delivery of events, failedDate seems questionable. Either the event has been processed or not. Handling failures is the responsibility of the application code. The only failures the application cannot handle are failures caused by the event delivery mechanism (event can no longer be deserialized, …). But I have a hard time imagining a scenario where I might find failedDate alone useful. Either I need more diagnostics (see next paragraph), or I have to choose between dropping the event or retrying forever.

For handling problems, failedDate alone is inadequate. For effectively dealing with failed deliveries, I would have to distinguish whether the failures might be transient (think OptimisticLockingFailureException, some network problem) or permanent (object concerning the event has been deleted, …). Furthermore, I would want to keep track of the number of failures and stop retrying after some attempts. This means we're quickly getting into the territory of Spring Retry and friends.

Therefore, I think Modulith should focus on the delivery side of things, for example, by better tracking the status (event queued, event processing, …) and providing better docs, and perhaps some callbacks, on how to deal with event delivery problems, like event classes that have been removed or event listeners that no longer exist.