sequinstream / sequin

Postgres CDC to streams and queues like Kafka, SQS, HTTP endpoints, and more
https://sequinstream.com
MIT License
526 stars 16 forks source link

Separate `seq` from `commit_lsn` #458

Open acco opened 1 month ago

acco commented 1 month ago

Relatively high priority, as we're "collapsing" events that happen to the same row inside of a transaction, which is undesirable.

Right now, we have an issue upserting to sequin_events tables: seq is set to the commit_lsn. But commit_lsn + record_pk is not unique — a record can be updated multiple times inside of a commit_lsn (a single transaction).

We store commit_lsn in a few places:

For consumer_records, it's OK to flatten multiple "touches" of a row into one record (mostly — if the last thing that happens is a delete, we actually want the delete to win).

For wal_events and sequin_events, instead of just a commit_lsn, what we really want is {commit_lsn, commit_index} — i.e., a combination of both the commit_lsn as well as the index of the specific event inside the commit. To get that, we'd simply have Replication count events received after a transaction is started, tagging each WAL event with its index. The order is (commit_lsn, index) asc

It might be too complicated to do sort by everywhere by commit_lsn, index when pulling from sequin_events? So we could keep seq, turning it into an auto-incrementing bigserial.

Migration

In terms of migration path for sequin_events, we'd:

  1. Have the WAL Pipeline Server migrate destination schema on boot if needed. We'd add commit_lsn (and perhaps commit_index if desired) as null: true to the table (we'll keep null: true indefinitely)
  2. Convert seq to bigserial, have the sequence start at whatever max(seq) currently is.
  3. For new sequin_events table, instruct to add commit_lsn null: false

Update on schemas

RTLS commented 1 month ago

Overall looks good.

Have the WAL Pipeline Server migrate destination schema on boot if needed. We'd add commit_lsn (and perhaps commit_index if desired) as null: true to the table (we'll keep null: true indefinitely)

Why do sequin_events tables need either of commit_lsn or commit_index columns? Can we not just use a bigserial seq column? The WAL Pipeline Server, as a single threaded writer, will ensure that seq is properly ordered for reads.

Basically I think we just have to:

  1. add commit_index to wal_events and write this as the index of messages in each transaction from Replication
  2. order by commit_lsn, commit_index when shuffling events from wal_events to sequin_events tables
RTLS commented 1 month ago

We need commit_lsn and commit_index from replication slot -> wal events -> sequin events for de-duplication in case we replay events from the replication slot