utterances-bot commented 1 year ago

The peculiar event sourced deadlock / Jappie

https://jappie.me/the-peculiar-event-sourced-deadlock.html

azaretsky commented 1 year ago

Postgres’ auto increment sidesteps the transaction, which was quite shocking to me

I think the whole reason for non-transactional behaviour of sequences in PostgreSQL is to avoid the exact single-row-locking-on-update scenario that caused the bug: a transaction can obtain the next value immediately at any point before it's committed, but so can all other transactions currently in-progress, so there's no way to roll that value back, because subsequent values could have been used by some other transactions that have already been committed.

azaretsky commented 1 year ago

The graphs showed high CPU load while processing packs, so maybe the rest of the system was being deprioritized somehow.

I would go as far as to say that a single long-running CPU-hungry job never gets in the way of all other tasks running concurrently, because the CPU is quite easily shareable, and kernel schedulers are pretty good at uniformly distributing the load. So it's very fortunate you didn't waste your time on investigating this possibility :)

The arithmetic is very simple. Suppose we normally process lots of small requests each using 100% CPU, but taking 0.25s on average. Now, let's start a batch processing job that uses 100% CPU and runs for 2h. For simplicity let's assume that our small requests arrive sequentially, so at any time we have at most two tasks running in parallel. This means that if a request arrives while our heavy batch job is executing they both will get 50% CPU, i.e. will run at half the speed. So our small request processing will take just 0.5s instead of 0.25s, but it certainly won't be delayed for two hours.

High CPU load may still affect real-time processing tasks, e.g. cause noticeable jitter in audio live-streaming. But these problems quite often can be overcome with nice/renice (and ionice -c3 - my favourite to make long-running disk-heavy tasks behave).

azaretsky commented 1 year ago

Also, thanks for a nice write-up on event sourcing :)

I was wondering, why use a separate table for event applications instead of adding two nullable attributes to the event table (let's call them application_id for event_applied.id, and applied for event_applied.created), but then realised that in case one needs to replay the history it would be much faster to just TRUNCATE event_applied, than to run UPDATE event SET application_id = NULL, applied = NULL on the whole table, which would also lock every row in probably the busiest and the largest table in the system.

jappeace commented 1 year ago

Another reason for creating a seperate table is that we shouldn't modify our event, unless there is no other option One idea is that you just track all events and let them live in your database forever, unmodified. Sometimes you've to modify in case of bugs, for example if you forgot to record an id so reprojection becomes impossible because relations ships break. Modification is now neccisary, which is fine but it shouldn't be routine. Creating a seperate table that tracks these applications makes these modifications less routine.

jappeace / jappeaceApplication

https://jappie.me/the-peculiar-event-sourced-deadlock.html #8

The peculiar event sourced deadlock / Jappie