`sequence_nr` didn't increment; prevents saving new events

ellis commented 6 years ago

Hmm, I find this one puzzling. The problem is that sequence_nr apparently didn't increment, so it tries to save every new event with sequence_nr = 177, but it already exists in the database as the last sequence number. Here's the exception:

{ error: duplicate key value violates unique constraint "data_event_journal_uq"
    at Connection.parseE (/Users/ellisw/src/project1031/node_modules/pg/lib/connection.js:553:11)
    at Connection.parseMessage (/Users/ellisw/src/project1031/node_modules/pg/lib/connection.js:378:19)
    at Socket.<anonymous> (/Users/ellisw/src/project1031/node_modules/pg/lib/connection.js:119:22)
    at emitOne (events.js:115:13)
    at Socket.emit (events.js:210:7)
    at addChunk (_stream_readable.js:266:12)
    at readableAddChunk (_stream_readable.js:253:11)
    at Socket.Readable.push (_stream_readable.js:211:10)
    at TCP.onread (net.js:585:20)
  name: 'error',
  length: 299,
  severity: 'ERROR',
  code: '23505',
  detail: 'Key (persistence_key, sequence_nr)=(skus A36OZVWF9SSJ7P, 177) already exists.',
  hint: undefined,
  position: undefined,
  internalPosition: undefined,
  internalQuery: undefined,
  where: undefined,
  schema: 'public',
  table: 'data_event_journal',
  column: undefined,
  dataType: undefined,
  constraint: 'data_event_journal_uq',
  file: 'nbtinsert.c',
  line: '434',
  routine: '_bt_check_unique' }

I would try to troubleshoot this, but I'm missing something pretty basic, so maybe you can point me in the right direction. I only find a few occurances of sequence_nr in the nact-persistence-postgres code, but I don't see how any of them cause increments?

The problem started occurring when some operation crashed. Do you have any idea how this could happen?

I was able to get past the error by deleting the last event.

ncthbrt commented 6 years ago

Direct answer to where incrementing happens: https://github.com/ncthbrt/nact/blob/master/lib/persistence/persistent-actor.js#L105

@ellis: My First question is did you have events with sequence number 1-177 in your db?

My second question is whether when rehydrating the actor, is it possible that the message with sequence number 177 is silently "poisoning" your state? By that I mean, is it possible that 177 is throwing an exception each time it was replayed?

Third question, are you awaiting after every persist? If not, it could be a race condition.

It could be an nact bug, but is may also be something strange going on with your application level code. Or perhaps it is an intersection of the two. (I know, I know, I've enumerated every possibilty here)

Note: Definitely don't want to dismiss this issue with out more information. If it is a bug, this is very high priority, as it could cause production outages to our aggregates.

ellis commented 6 years ago

Thanks for the link to the code.

1) Yes, there were events with sequence numbers 1-177 in the db.

2) Yes, it's definitely possible that event 177 caused an exception during rehydration. Unfortunately, I deleted the row, so I can't tell anymore :(

3) Yes, I just did a global search, and all calls to ctx.persist are awaited.

I'm just guessing, but based on what you wrote, could the error be caused as follows?:

Run 1:

receive a message
update state
save an incorrect event 177
possibly crash at some point

Run 2:

rehydrate and throw an exception on event 177 (but the system keeps running)
receive another message
try to save an event
event can't be saved because of an incorrect sequence_nr

ncthbrt commented 6 years ago

That is a very plausible scenario @ellis. I think even Martin Fowler mentioned that it happened to him in an event driven system he worked on.

Do you have any custom supervision policy set up? This line https://github.com/ncthbrt/nact/blob/master/lib/persistence/persistent-actor.js#L81 indicates that is an exception was thrown during rehydration, it'd be passed into the error handler, which would mean that the rest of the events would not be processed. This is in a strict sense a bug, because rehydration should not behave that way and would cause an error with colliding keys. Not sure what the correct behaviour in this scenario should be however?

ellis commented 6 years ago

@ncthbrt I don't have any supervision policy set up.

Would it be conceivable to: 1) drop the sequence_nr column, 2) change the schema so ordering auto-increments, and 3) use ordering instead of sequence_nr in the code? But in addition to being a breaking change, I can imagine it would be a complicated and tedious change to your design... I just thought I'd throw out the idea because it would guarantee that such conflicts couldn't happen anymore, so at least one wouldn't need to think about how to prevent them, and maybe it'd be a valuable simplification in the long run.

ncthbrt commented 6 years ago

The table structure was copied from Akka. I'd be a bit hesitant in changing the structure:

Snapshotting relies on offsets to work correctly, you could switch to using ordering instead, but how would it work if you aren't using Postgres? Not all types of databases support the idea of a global ordering, so for these databases, you'd have to have some sort of sequence number anyway. That's ok, but has the downside that you're moving well tested code (related to retrieving the latest snapshot and determining what events to retrieve based on that) out a layer and having to duplicate the logic for all storage implementations.

Another potential issue is that something went wrong in application code. Would you not rather have it crash than having it continue writing when it's possibly in an inconsistent state?

I agree that nact needs to help the developer defend against the kind of issue you experienced, but the right way here to deal with this would be to acknowledge that it is possible that a previously persisted message could "poison state" and then provide mechanisms to recover from that specific failure mode. Just merrily continuing isn't a strategy that I think would apply to everyone (though it may work for some use cases)

ellis commented 6 years ago

Those are some very good points @ncthbrt. I'll give it more thought then.

ncthbrt commented 6 years ago

If you come up with any ideas, please let me know. 🙂

ncthbrt commented 6 years ago

@ellis I'm busy working on a fix which will enable you to specify a supervision policy for persistent actors. If you resume on error if recovering, it'll use the correct sequence numbers.

ncthbrt commented 6 years ago

I'm closing this issue as I've released v7.0.2 of nact which allows for error recovery when hydrating

ellis commented 6 years ago

@ncthbrt Cool, good to know.

ncthbrt / nact-persistence-postgres

`sequence_nr` didn't increment; prevents saving new events #7