Closed rohit-nayak-ps closed 4 months ago
Hi @rohit-nayak-ps This feature looks interesting to me, I would like to work on this feature as an lfx mentee, can you give me brief idea about the prerequisites to get started with this issue
Hi @rohit-nayak-ps This feature looks interesting to me, I would like to work on this feature as an lfx mentee, can you give me brief idea about the prerequisites to get started with this issue
@TheRealSibasishBehera , good to hear that you are interested. I have added initial notes about the prerequisites at the head of this issue description as well as links to mentee application procedures. Let us know if you you need more information/clarifications.
That'll be a great addition to vitess. Going through the description looks like it matches my skills. I'll apply for it
Hello Sir @rohit-nayak-ps I know basics of go and am currently referring to the resources added by you above to get familiar with the project . Very excited to contribute to it as a Linux Foundation mentee for the upcoming spring term.
@rohit-nayak-ps
Since this seems to be a project needing cloud resources, how will a mentee run the tests during development? What are the pre-requisites for learning how to use the platform?
Will we also have to develop an UI for the configuration of the benchmark - like setting no. of shards, no. of streams etc ?
- Since this seems to be a project needing cloud resources, how will a mentee run the tests during development? What are the pre-requisites for learning how to use the platform?
Good question. We will start with building a local adapter so that we can run the different Vitess components on docker on your local machines itself. We can run it with small amount of data so that we don't need a lot of local computing power. Once we have a working local setup then we will provide cloud resources.
Pre-requisites are mentioned at the top of this issue. If you have any specific questions, feel free to ask.
- Will we also have to develop an UI for the configuration of the benchmark - like setting no. of shards, no. of streams etc ?
There is no plan for a UI for configuration. However we do have plans for a UI to look at the results of the benchmark runs. That is not in the scope of the initial LFX project, though of course people are welcome to work on them as well if they have the time.
Hey @rohit-nayak-ps, I am interested in this LFX mentorship mentee spring term. Currently, I have an existing skillset to implement the current goals mentioned in the description.
Excited to be a part of this. As it would be my first mentorship program.
We have decided not to pursue this at the moment since it will take significant resources to build and maintain.
Feature Description
VReplication is a core component in Vitess. Production Vitess clusters regularly depend on workflows like Resharding, MoveTables and Materialize as well as the use of the VStream API. This has added VReplication to the critical path. While we do have good unit test and e2e test coverage we do not measure performance. Also some failures are not as easy to reproduce in local tests, like reparenting operations; transient network and database failures; connection and memory leaks; etc.
We propose creating a framework which will allow defining test cases for different VReplication workflows which will be run at partial scale, validate the results and potentially store benchmark output.
In the rest of this document we outline specific goals, challenges that will need to be addressed and a proposed implementation architecture.
Practical Aspects
The tests will be fairly expensive in terms of cpu time and number of instances. Hence we will not run them on demand (like arewefastyet, for example). It is likely we will initially, at least, run on private infrastructure (until and unless we get free infra from CNCF or any other source). Tests will be run periodically, say every week, to catch performance and functionality regressions. They can also be run on specific PRs that are expected to improve or impact performance.
Specific Goals
Testing
We will run long-running workflows (~hours) on different cluster configurations with intermittent reparents and simulating common failures on non-trivial data sizes and different table schemas. These are not intended to be comprehensive functionality tests but smoke-tests for curated cluster and data configurations and specific workflows. The aim is to catch and surface existing bugs and regressions.
Benchmarks
For some of the test configurations we will publish performance results (like rows per second, GiBs per second, CPU and memory usage, etc). These will act as reference benchmarks for the community to get an idea of approximate sizing required for Vitess clusters and estimating how long workflows will run.
Note that this will be just an indication: actual performance is highly dependent on the nature of the data, network configurations, underlying hardware etc.
Non-goals
This framework is NOT intended to replace unit and e2e tests in Vitess. In particular, these tests will NOT run for every PR or push.
Implementation
Workflow Configurations
Approach
Benchmark Measures
Each benchmark run should also attach the full configuration for the test including schema, and all vreplication related metrics.
Proposed Benchmark Configs
Implementation Artifacts
Initial data file for huge/large table. We can base this on TPCC datasets
Data populator for generating streaming data
The DSL specification. The current thought is to do this in HCL since it offers a highly customizable option and is also well maintained.
DSL parser
Driver that runs tests based on the DSL configurations
Backend adapter: first, a docker adapter for local development followed by an adapter for AWS EC2.
Result storage backends: YML / PlanetScale
[ ] #13009
[ ] #13011