RFC: VStream client reference implementation

derekperkins commented 1 week ago

VStream: the hidden superpower

We all use VStream primitives via their targeted implementations with VReplication and friends, but there is limited usage of VStream natively. At the other end of the spectrum, there are very specific implementations using VStream:

I think there are so many great use cases for VStream, but it is intimidating, with a lot of edge cases that need to be handled correctly. The best example we have today is in the local example, that doesn't show much more than how to get it started and print the events. Obviously all the primitives exist to translate the events into usable Go data structures, but those aren't well documented. If it were easier to get started, I believe VStream usage would increase significantly.

Use Cases

There are lots of things that can be done by consuming the vstream, but the most common use case is syncing changes to a data warehouse. Today, that is complicated by:

MySQL replication is widely supported as a data sync mechanism, but Vitess makes this generally unfeasible (see below)
Even if we did manage to convert a VStream into a standard MySQL replication stream for compatibility with tools mentioned in 1, those still often require additional infra, like Kafka, Flink, etc.
Debezium is great, but is a very heavy dependency requiring Kafka or similar message queue, which significantly increases the cost and complexity overhead. It's also written in Java, which may not be as accessible to Vitess users more familiar with Go

Solution

Teams already have Vitess expertise, so lean into that. I propose adding a reference implementation of a client framework for consuming VStream.

Goals

Provide a simple, production ready, semi-opinionated way to consume vstream without needing deep expertise
Translate vstream events back into Go structs
Be a starting point for anyone interested in implementing vstream, like a custom data warehouse connector
Have pervasive comments / documentation about edge cases, gotchas, and good patterns

Non-Goals

Require any special server or access to Vitess internals
Be necessary to use vstream
Target any specific data warehouse or use case

One way to think about this is like an pluggable Materialize stream. In fact, that was one route I had considered for connecting to a data warehouse - to materialize rows into an unsharded MySQL instance, then expose that replication stream to one of the aforementioned CDC consumers. That still didn't solve the issue of added cost and overhead, while also requiring extra MySQL infrastructure to write rows. Instead of Materialize writing rows to the database, the idea here is to use a VStream framework to convert binlogs into the same Go structs that represent the db rows, but that can be programmatically exported anywhere.

Open Questions

Where should this live?
- The current examples directory feels too limiting
- https://github.com/vitessio/contrib feels abandoned, and is bad for discovery
- https://github.com/vitessio/vitess/tree/main/go/vt/vstreamclient seems the best to me, but also isn't quite the same as the other packages
How should state be managed? I think state should be stored in Vitess itself, using a user-supplied keyspace / table. Since this isn't supposed to be privileged, it would live with their tables, not in _vt

Implementation

I have already built a working implementation. I started just fleshing out the local vstream example, but it quickly ballooned in complexity past what should be expected in an example. Usage is as shown in the tests. Feedback very appreciated.

https://github.com/vitessio/vitess/pull/17222

Related Resources

cc @mattlord

rohit-nayak-ps commented 4 days ago

This is a really good idea. I agree that there’s significant untapped potential for using VStream, likely due to its non-trivial barrier to entry. This implementation could also serve as a reference point for others to fork and adapt to their specific needs. I’m excited to review it once you’re done.

One concern I have is about where we locate this: if it becomes part of a core Vitess package rather than being treated as an example or contribution, the responsibility for maintenance, enhancements, and bug fixes will fall on the maintainers. We’re already stretched supporting the operator, Java packages etc. From previous experience, early adopters are likely to encounter edge cases, bugs, or have requirements that VStream supports but may not yet be implemented in this client.

I am also not sure about storing state in _vt. Firstly we may need those to be "multi-tenant" since there might be multiple vstream client instances in a cluster. The state schema requirements might vary from use-case to use-case. The data won't get automatically sharded in case of a Reshard. etc ...

Let’s discuss both the concept, implementation details and the best place for it within the project in next week’s monthly meeting.

derekperkins commented 3 days ago

This implementation could also serve as a reference point for others to fork and adapt to their specific needs

That's exactly what I'm hoping for. For example, using this to start shipping data to BigQuery using their CDC feature is probably less than 100 lines of code.

if it becomes part of a core Vitess package rather than being treated as an example or contribution, the responsibility for maintenance, enhancements, and bug fixes will fall on the maintainers

Agreed, I really couldn't find a great place for it. I plan on using this extensively internally, so I can support it. The key to me, wherever it lands, is to be very clear about the goals and non-goals. It isn't supposed to be the end-all, most feature-rich, only way to consume vstream. I think it can generically handle ~80% of cases, since all it does is copy from vstream into Go structs, and what happens with that data is handled in the callback functions. For more specific / performance-critical cases, the expectation would be to just use this as a starting point, or if implementing in another language, make it easier to copy the flows.

I am also not sure about storing state in _vt

Also agreed, I don't think it should be stored there. The current design has the user provide a keyspace/table where the state gets placed. For this to be a good reference framework for other implementations, it shouldn't have any access to non-public Vitess code/state.

https://github.com/vitessio/vitess/pull/17222/commits/829c980fbfed28f0d7141efcfdfb657ad85e6719#diff-46114495962c95fc3ac7f3fe35f92269700a681c162b09ab337c0838dbc144d8R50-R66

The state schema requirements might vary from use-case to use-case

Can you talk a little more about this? I'm hoping that can be managed by this framework, and the schema is pretty simple, given that I'm storing the vgtid as a json blob. This is one of my largest outstanding questions, which is why the state functions aren't fully fleshed out. My naive expectation is that for most users, there would be a single table in an unsharded keyspace, with a single row holding the full state for each stream.

https://github.com/vitessio/vitess/pull/17222/commits/829c980fbfed28f0d7141efcfdfb657ad85e6719#diff-a56f649048575b76ecb941f1220c142841375000dba82914745c9549e2b0c82fR324-R337

vitessio / vitess