noctarius / timescaledb-event-streamer

timescaledb-event-streamer is a command line program to create a stream of CDC (Change Data Capture) TimescaleDB Hypertable events from a PostgreSQL installation running the TimescaleDB extension.
Apache License 2.0
40 stars 3 forks source link

Do I need to write a plugin to pull changes to another database? #148

Closed thalesmaoa closed 8 months ago

thalesmaoa commented 9 months ago

Hi @noctarius,

Firstly, congratulations on your outstanding work. I recently joined from @jonatasdp's recommendation in the Timescale Community, and I've been diving into the documentation and code you provided. While it's been enlightening, I'm encountering some challenges in grasping certain aspects.

By default, replicas seem to work with Postgre. However, my requirement is to clone a database from different locations, and I can't use backup tools—only online streaming. Complicating matters, the standby cloud server cannot connect to the primary due to various internet provider security restrictions. The connection needs to originate from the primary to the cloud.

Being relatively new to PostgreSQL, I'm navigating through unfamiliar territory, so please bear with me. I've observed that the available sinks don't seem to support updating an external database. Is my understanding correct? If so, does this mean I'll have to develop a new plugin to achieve this functionality?

noctarius commented 9 months ago

Hey @thalesmaoa! Sorry for the late answer and thank you very much for the kind words 😊

Writing to a database isn't directly implemented (while it'd be totally possible to write a plugin for that). The reason is that the queuing would be missing. There are situations where the target database wouldn't be available and you'd have to queue up messages. That's why a queue should (like Kafka or NATS JetStream) should be in between.

A potential way around that issue could be to kill the process in case of the target database not being available, but that could have other side effects and I'm not sure I'd be a fan of that.

The general issue is that you need to make sure that your queries are executed in order, otherwise you might run into consistency issues between source and target.

The way I see you issue, I think there are two possible solutions:

thalesmaoa commented 8 months ago

Hi @noctarius,

Hope this message finds you well. I appreciate your previous guidance, and I've been diving deeper into my studies over the past month. Now, I'm finding things clearer.

I've decided to explore Kafka and develop a small application to manage incoming messages. Following your suggestion, it seems like NATS JetStream is the ideal choice for my requirements.

However, there's one thing I'm still trying to grasp related to Timescale and chunks.

As per my understanding, when using a pub/sub program with NATS, I'll need to rewrite the queries to the cloud. For instance, if a message is published in "sensor.temperature," a subscribed client will read the message contents and create a new "INSERT INTO database VALUE (msg_time, msg_value)" in the cloud.

If my understanding is correct, the cloud will manage its own chunks, and essentially, I'm only utilizing the CDC, right?

Thanks again for your help!

timescaledb-streeamer

noctarius commented 8 months ago

Yes, that's basically how it works. You'd capture the event, create an insert / update / delete statement according to the retrieved event message and send that against the cloud database.

thalesmaoa commented 8 months ago

Thx! That's solve it