vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
16.94k stars 1.46k forks source link

CDC tool for databases #20179

Open AbstractiveNord opened 3 months ago

AbstractiveNord commented 3 months ago

A note for the community

Could be Vector augment or replace tools like Debezium? A lot of sources, transforms, sinks already exists, and would be nice to have a ready for use DB CDC source in Vector. A PostgreSQL CDC source, for example, because PostgreSQL is highly popular.

Use Cases

CDC is widely used for data processing, especially in microservice architecture. For example, data from PostgreSQL may be needed to full-text indexed with ElasticSearch/OpenSearch/ManticoreSearch/etc, passed into Kafka pipelines or delivered to DWH, etc. Vector already support search engines, MQs as sinks, so Vector-based CDC looks as good idea.

Attempted Solutions

Debezium

Proposal

No response

References

Version

0.36.1

jszwedko commented 3 months ago

Thanks @AbstractiveNord, that is interesting. Vector is fundamentally a tool for processing observability data so I'm not sure satisfying the use-cases that Debezium seems to be targeting would be in scope though. It seems like it is meant for general event processing?

AbstractiveNord commented 3 months ago

Thanks @AbstractiveNord, that is interesting. Vector is fundamentally a tool for processing observability data so I'm not sure satisfying the use-cases that Debezium seems to be targeting would be in scope though. It seems like it is meant for general event processing?

On the one hand, yes, it's a out of scope a little. On the other hand, Vector implemented a lot of required stuff, like sinks, temp buffers, etc, it's battle tested and highly popular tool, so forking Vector project seems useless. Also, if Vector will support CDC, then logs probably can be enriched even with data directly from business events, not just logs.

Let's say we have a typical microservice architecture with PostgreSQL, Kafka, and some micro's. Input event pushes to PostgreSQL table and than goes to Kafka queue. In case of Vector, CDC support may allow to generate a log record directly by Vector, based on fact that's event successfully moved from PG to Kafka. Even with that example, CDC support for Vector can be useful for observability too.

AbstractiveNord commented 3 months ago

I see that source as highly similar to file based source, just adopted to WAL segments. Feel free to correct me, I may be wrong at it.

jszwedko commented 3 months ago

Thanks for the additional thoughts! The file source primarily exists to read logs written by applications from files rather than reading business events for processing, but I can see what you are saying about Vector being mostly fit for this use-case with minor improvements. I'm just wary of Vector's use-cases becoming too broad and its core functionality suffering for it. If we had the ability to have source plugins I think this would be a good candidate for that 🙂

For other readers, since I was unfamiliar, CDC is "change data capture": capturing record level changes in a database.

AbstractiveNord commented 3 months ago

Thanks for the additional thoughts! The file source primarily exists to read logs written by applications from files rather than reading business events for processing, but I can see what you are saying about Vector being mostly fit for this use-case with minor improvements. I'm just wary of Vector's use-cases becoming too broad and its core functionality suffering for it. If we had the ability to have source plugins I think this would be a good candidate for that 🙂

For other readers, since I was unfamiliar, CDC is "change data capture": capturing record level changes in a database.

Yes in general, just I am not sure that's CDC can cause Vector to become too broad. In fact, very useful, mostly fit for observability usecases as pointed, additional popularity as Debezium alternative candidate, written in Rust, etc.