Open derekperkins opened 1 week ago
This is a really good idea. I agree that there’s significant untapped potential for using VStream, likely due to its non-trivial barrier to entry. This implementation could also serve as a reference point for others to fork and adapt to their specific needs. I’m excited to review it once you’re done.
One concern I have is about where we locate this: if it becomes part of a core Vitess package rather than being treated as an example or contribution, the responsibility for maintenance, enhancements, and bug fixes will fall on the maintainers. We’re already stretched supporting the operator, Java packages etc. From previous experience, early adopters are likely to encounter edge cases, bugs, or have requirements that VStream supports but may not yet be implemented in this client.
I am also not sure about storing state in _vt
. Firstly we may need those to be "multi-tenant" since there might be multiple vstream client instances in a cluster. The state schema requirements might vary from use-case to use-case. The data won't get automatically sharded in case of a Reshard
. etc ...
Let’s discuss both the concept, implementation details and the best place for it within the project in next week’s monthly meeting.
This implementation could also serve as a reference point for others to fork and adapt to their specific needs
That's exactly what I'm hoping for. For example, using this to start shipping data to BigQuery using their CDC feature is probably less than 100 lines of code.
if it becomes part of a core Vitess package rather than being treated as an example or contribution, the responsibility for maintenance, enhancements, and bug fixes will fall on the maintainers
Agreed, I really couldn't find a great place for it. I plan on using this extensively internally, so I can support it. The key to me, wherever it lands, is to be very clear about the goals and non-goals. It isn't supposed to be the end-all, most feature-rich, only way to consume vstream. I think it can generically handle ~80% of cases, since all it does is copy from vstream into Go structs, and what happens with that data is handled in the callback functions. For more specific / performance-critical cases, the expectation would be to just use this as a starting point, or if implementing in another language, make it easier to copy the flows.
I am also not sure about storing state in
_vt
Also agreed, I don't think it should be stored there. The current design has the user provide a keyspace/table where the state gets placed. For this to be a good reference framework for other implementations, it shouldn't have any access to non-public Vitess code/state.
The state schema requirements might vary from use-case to use-case
Can you talk a little more about this? I'm hoping that can be managed by this framework, and the schema is pretty simple, given that I'm storing the vgtid as a json blob. This is one of my largest outstanding questions, which is why the state functions aren't fully fleshed out. My naive expectation is that for most users, there would be a single table in an unsharded keyspace, with a single row holding the full state for each stream.
VStream: the hidden superpower
We all use VStream primitives via their targeted implementations with VReplication and friends, but there is limited usage of VStream natively. At the other end of the spectrum, there are very specific implementations using VStream:
I think there are so many great use cases for VStream, but it is intimidating, with a lot of edge cases that need to be handled correctly. The best example we have today is in the local example, that doesn't show much more than how to get it started and print the events. Obviously all the primitives exist to translate the events into usable Go data structures, but those aren't well documented. If it were easier to get started, I believe VStream usage would increase significantly.
Use Cases
There are lots of things that can be done by consuming the vstream, but the most common use case is syncing changes to a data warehouse. Today, that is complicated by:
Solution
Teams already have Vitess expertise, so lean into that. I propose adding a reference implementation of a client framework for consuming VStream.
Goals
Non-Goals
One way to think about this is like an pluggable
Materialize
stream. In fact, that was one route I had considered for connecting to a data warehouse - to materialize rows into an unsharded MySQL instance, then expose that replication stream to one of the aforementioned CDC consumers. That still didn't solve the issue of added cost and overhead, while also requiring extra MySQL infrastructure to write rows. Instead ofMaterialize
writing rows to the database, the idea here is to use a VStream framework to convert binlogs into the same Go structs that represent the db rows, but that can be programmatically exported anywhere.Open Questions
_vt
Implementation
I have already built a working implementation. I started just fleshing out the local vstream example, but it quickly ballooned in complexity past what should be expected in an example. Usage is as shown in the tests. Feedback very appreciated.
Related Resources
cc @mattlord