Strange behaviour with Optimistic lock error

berkes commented 1 year ago

While investigating how the optimistic locking is protected, I found that apparently it's not protected.

I'm not sure if that's by design, an oversight or a todo.

Looking at the test https://github.com/serverlesstechnology/postgres-es/blob/main/src/event_repository.rs#L384C13-L384C33 we see that it is tested for. However, that test gives a false positive, as it fails on a duplicate-index (aggregate_id=x, and sequence=2) and not on the fact that 2 comes after 3. Reading through the implementation, I see nothing that checks that sequences must be higher than what is in the database.

Is the optimistic locking meant to merely protect against a sequence number (icw an aggregate_id) from occurring twice? Or should It protect against sequence numbers being lower than what has been received already?

If wanted, I can whip up a patch that introduces such a protection, but the actual implementation needs some thought then (to avoid race-conditions between read-current-sequence and write-new-sequence). Please let me know if this is desired.

davegarred commented 1 year ago

Hi @berkes, thanks for the message. The optimistic lock is meant only to ensure that the same aggId-sequence value does not get duplicated in the DB to get that optimistic locking. The CQRS framework should guarantee that the attempt is made with the next sequence number available.

I may be missing the potential issue though (a not uncommon issue). What is the failure mode that you're worried about here? And what are your thoughts on the fix?

berkes commented 1 year ago

I came across this when implementing implementations of PersistedEventRepository for sqlite and in-memory.

I don't have any issues in practice that I've encountered because of this; it's purely theoretically.

One case where the issue might be problematic in practice, is when there are holes in the sequences. I guess simplly solved by "ensuring there are no holes in the sequences" though 😉.

When we have event 1, 2, 99 and 100, we don't guarantee that the next event must be 100+1 but only that it did not yet occur; so after 100, we could have event 3, 4, etc.

In other CQRS and ES systems that I worked on, the optimistic locking ensured that "the next event ID as provided by the requester-to-store much match the next event ID as the storage thinks it should be". E.g. by opening a transaction, incrementing a counter, checking the provided event-id against that incremented counter, and if matched inserting the event and committing the transaction.

So my title "strange behaviour" is actually wrong: I should have called it something like "this is not how I expected it to work" 😄

I'm happy to close this thread if this is "by design". Which I can understand. It was just not what I was expecting.

davegarred commented 1 year ago

Ah, I see. This should never happen, but if it does it becomes a problem. Of course in production I've found that should never many times means usually will due to a bug or some other reason (an engineer finds a quick way to fix a state problem in production is to inject a certain event but fumbles getting the sequence correct).

It seems like we might need a couple of things here:

When loading events, we should ensure all of them exist, otherwise we have a broken aggregate (this may be the case already but tests should be improved).
A better insertion transaction (that doesn't violate the "be lightweight" tenet) to guarantee that a the system does not inject an event in this bad state.
Some tests to demonstrate that this behavior should not happen.

TBH I haven't seen this transaction pattern before, I'd love your thoughts on the best way to do this.

serverlesstechnology / postgres-es

Strange behaviour with Optimistic lock error #9