Vitess VReplication Testing Framework

rohit-nayak-ps commented 1 year ago

Notes for Potential LFX Mentees:

This project is part of the LFX program. More information at https://github.com/cncf/mentoring/tree/main/lfx-mentorship. For mentees, details on how to apply are at https://github.com/cncf/mentoring/tree/main/lfx-mentorship#how-to-apply. This project will not, per se, require contributions to Vitess, but the contribution guide has information on learning both golang and Vitess concepts for someone new to Vitess. To learn HCL: https://github.com/hashicorp/hcl. This link talks about creating a custom DSL using HCL.

:warning: If you're thinking about opening PRs to the project before the application period begins, please read the initial sections regarding contribution guidelines and advice from a previous gsoc project!

Feature Description

VReplication is a core component in Vitess. Production Vitess clusters regularly depend on workflows like Resharding, MoveTables and Materialize as well as the use of the VStream API. This has added VReplication to the critical path. While we do have good unit test and e2e test coverage we do not measure performance. Also some failures are not as easy to reproduce in local tests, like reparenting operations; transient network and database failures; connection and memory leaks; etc.

We propose creating a framework which will allow defining test cases for different VReplication workflows which will be run at partial scale, validate the results and potentially store benchmark output.

In the rest of this document we outline specific goals, challenges that will need to be addressed and a proposed implementation architecture.

Practical Aspects

The tests will be fairly expensive in terms of cpu time and number of instances. Hence we will not run them on demand (like arewefastyet, for example). It is likely we will initially, at least, run on private infrastructure (until and unless we get free infra from CNCF or any other source). Tests will be run periodically, say every week, to catch performance and functionality regressions. They can also be run on specific PRs that are expected to improve or impact performance.

Specific Goals

Testing

We will run long-running workflows (~hours) on different cluster configurations with intermittent reparents and simulating common failures on non-trivial data sizes and different table schemas. These are not intended to be comprehensive functionality tests but smoke-tests for curated cluster and data configurations and specific workflows. The aim is to catch and surface existing bugs and regressions.

Benchmarks

For some of the test configurations we will publish performance results (like rows per second, GiBs per second, CPU and memory usage, etc). These will act as reference benchmarks for the community to get an idea of approximate sizing required for Vitess clusters and estimating how long workflows will run.

Note that this will be just an indication: actual performance is highly dependent on the nature of the data, network configurations, underlying hardware etc.

Non-goals

This framework is NOT intended to replace unit and e2e tests in Vitess. In particular, these tests will NOT run for every PR or push.

Implementation

Workflow Configurations

Multiple types of workflows like MoveTables, Reshard, Materialize, VStream
Sharded/Unsharded. Also VDiff.
Different number of Source/Target keyspace shards
Table distributions: huge tables, lots of small tables
Data types and widths of columns
Primary Key configurations: simple, compound, data types
Flags/Options: vttablet and workflow parameters (e.g. vstream_packet_size) that affect VReplication

Approach

Small number of selected benchmark configs
Run all tests within <8-12?> hours sequentially, to maintain a small infra footprint
Small (but not too small) infra config to reduce cost
Mini-runs for quick turnarounds (10% data, 1 hour?)

Benchmark Measures

Time for workflow to “complete”
Replication lag for streams
CPU/Memory usage of vttablet
MySQL load
Network data usage (if possible)

Each benchmark run should also attach the full configuration for the test including schema, and all vreplication related metrics.

Proposed Benchmark Configs

Single Huge Table, compound PK (int, binary), Unsharded to Two Shards, Only Copy, 100M rows/100GB data
Large table + lots of small tables, different PKs, some wide tables, Two Shards to Four Shards, 1 Copy + 1K QPS for 1 hour, use VDiff here
Large number of streams: 1000 materialize streams, unsharded to unsharded, small initial rows, 10 QPS for 1 hour

Implementation Artifacts

Initial data file for huge/large table. We can base this on TPCC datasets
Data populator for generating streaming data
The DSL specification. The current thought is to do this in HCL since it offers a highly customizable option and is also well maintained.
DSL parser
Driver that runs tests based on the DSL configurations
Backend adapter: first, a docker adapter for local development followed by an adapter for AWS EC2.
Result storage backends: YML / PlanetScale
[ ] #13009
[ ] #13011

TheRealSibasishBehera commented 1 year ago

Hi @rohit-nayak-ps This feature looks interesting to me, I would like to work on this feature as an lfx mentee, can you give me brief idea about the prerequisites to get started with this issue

rohit-nayak-ps commented 1 year ago

Hi @rohit-nayak-ps This feature looks interesting to me, I would like to work on this feature as an lfx mentee, can you give me brief idea about the prerequisites to get started with this issue

@TheRealSibasishBehera , good to hear that you are interested. I have added initial notes about the prerequisites at the head of this issue description as well as links to mentee application procedures. Let us know if you you need more information/clarifications.

PaarthAgarwal commented 1 year ago

That'll be a great addition to vitess. Going through the description looks like it matches my skills. I'll apply for it

vishalvivekm commented 1 year ago

Hello Sir @rohit-nayak-ps I know basics of go and am currently referring to the resources added by you above to get familiar with the project . Very excited to contribute to it as a Linux Foundation mentee for the upcoming spring term.

abhinandanudupa commented 1 year ago

@rohit-nayak-ps

Since this seems to be a project needing cloud resources, how will a mentee run the tests during development? What are the pre-requisites for learning how to use the platform?
Will we also have to develop an UI for the configuration of the benchmark - like setting no. of shards, no. of streams etc ?

rohit-nayak-ps commented 1 year ago

Since this seems to be a project needing cloud resources, how will a mentee run the tests during development? What are the pre-requisites for learning how to use the platform?

Good question. We will start with building a local adapter so that we can run the different Vitess components on docker on your local machines itself. We can run it with small amount of data so that we don't need a lot of local computing power. Once we have a working local setup then we will provide cloud resources.

Pre-requisites are mentioned at the top of this issue. If you have any specific questions, feel free to ask.

Will we also have to develop an UI for the configuration of the benchmark - like setting no. of shards, no. of streams etc ?

There is no plan for a UI for configuration. However we do have plans for a UI to look at the results of the benchmark runs. That is not in the scope of the initial LFX project, though of course people are welcome to work on them as well if they have the time.

nikzayn commented 5 months ago

Hey @rohit-nayak-ps, I am interested in this LFX mentorship mentee spring term. Currently, I have an existing skillset to implement the current goals mentioned in the description.

Excited to be a part of this. As it would be my first mentorship program.

rohit-nayak-ps commented 4 months ago

We have decided not to pursue this at the moment since it will take significant resources to build and maintain.

vitessio / vitess