kibae / pg-logical-replication

PostgreSQL Logical Replication client for node.js
https://www.npmjs.com/package/pg-logical-replication
MIT License
95 stars 22 forks source link

Transactions info (time) in unit events ? #19

Open cyrilchapon opened 1 year ago

cyrilchapon commented 1 year ago

Hi,

I'm implementing a CDC use-case and found your library which is pretty cool (thanks for this 🙂).

Reading #17 + some Debezium posts saying they deprecated wal2json usage for pgoutput; I came to the conclusion I'd better go with the later.

I have tested both though, and one thing I liked with wal2json (v2) is the fact we have access to the commit-time of the transaction directly inside unit event (insert, update, ...)

Which is apparently not the case with pgoutput.

So my question is simple :

Is the fact a pgoutput DDL event (insert, update, delete) with PgOutPutPlugin due to

If the later, I'd try to PR this; but could you please give me some clues where and how to do so ?

Thanks 🙂

kibae commented 1 year ago

Hello, @cyrilchapon 😃

Obviously wal2json has the advantage of handling transactions as a block. However, wal2json is very slow or doesn't work well when a large number of records change. In this case, pgoutput is very useful.

When using the pgoutput plugin, the start time of the transaction(commitTime) is delivered. https://github.com/kibae/pg-logical-replication/blob/9b191c5c4e0e9e953537c933fc331ae711e77587/src/output-plugins/pgoutput/pgoutput-parser.ts#L54-L65 I think it could be used while holding this time value as a member variable until commit.

@mastermatt How about this idea?

mastermatt commented 1 year ago

This pushes past my knowledge, notably around the fact that I've only ever needed to work with features available with protocol version 1. The issue I see off the bat, is how to do you correlate other message events to the correct BEGIN event.

@cyrilchapon you said DDL events, but I just to make sure we're on the same page, pgoutput does not include DDL events you'll get DML (insert, update, delete, etc.).

It seems the BEGIN event has a transaction ID, then subsequent DML events can reference that ID. From the docs:

Xid of the transaction (only present for streamed transactions). This field is available since protocol version 2.

As I've stated, I've never worked with streamed transactions in this context and I don't know the gotchas. I also don't understand how the BEGIN event can have a commit timestamp if the transaction is being streamed. One fear I'd have with having this lib owning the cache of the xid is knowing when to clear the cache. If it's streaming transactions, how guaranteed is it that the commit event will eventually be processed? and if not, is that a memory leak?

tl;dr: I don't know. But I'm curious, if anyone figures it out.