Automated stress tests - Githubissues

kokosing commented 5 years ago

@martint, @electrum, @dain Can you share your experience with a stress testing. What do you think needs to be tested:

what data and queries?
what workload (concurrency)?
how long it should be executed to catch production-like issues?
how much hardware do we need?
how can this be automated?
what needs to be verified (memory leaks, garbage collection, correctness)?

Are there any companies that would participate in this to get the "real" data and queries, or even hardware?

I was thinking about spinning a cluster(s) on AWS and the run the above frequently (like once a week). It would be good do it from the beginning, because now we have no many such tests and hence it should be easily automated. Then we could iteratively improve it.

sopel39 commented 5 years ago

I was thinking about spinning a cluster(s) on AWS and the run the above frequently (like once a week).

Per release should be good enough as those clusters can get expensive (even with spots). Anyway, we probably should have release branches so that verification process doesn't halt code development for long time.

kokosing commented 5 years ago

Anyway, we probably should have release branches

That would depend on how much time release verification would take. If it would take 2 weeks then ok, if 2 days then I would keep as it is.. Release branches have their own cons.

sopel39 commented 5 years ago

Release branches have their own cons.

True, but when the community gets larger it will be harder to enforce code freezes. I guess github bot could help here.

martint commented 5 years ago

I'd like to expand this to include general improvements to verification and test coverage. There are a few things we could do:

Add more unit tests and product tests to the codebase. However, as the testing suite gets larger, it gets harder to run everything on every PR. We'd need a way to segregate basic vs extended tests and have a mechanism to run these on a schedule vs on-demand (per PR).
We've been floating around some ideas about how to get organizations involved in the process of verifying Presto on their own infrastructure. We recognize that there are many things that cannot be shared outside those organizations due to internal dependencies or privacy/confidentiality constraints. However, if we had tools or protocols for these organizations to run their tests in an automated way and report results, it can be helpful to pre-qualify releases and have more certainty Presto continues to work for a diversity of use cases.

krakov commented 5 years ago

I can share a bit about the testing environment we do to test our connector (at Varada), focusing on things that have a potential to be used standalone.

But first, to your point @martint I see two ways to go IMHO: 1) Identify testing code that organizations can adapt / release under Apache license to Presto, and add that to a regression that runs nightly / pre-release, moving the maintenance to the Presto maintainers. 2) Identify tests that orgnanizations do that can not be released (due to reasons you mentioned) but can run them and pre-qualify releases. The second model as I see it is not very sustainable for release testing: organizational priorities and availability to that pre-qualify testing can change with time; no visibility into what is done by each organization makes it hard to achieve high standards of coverage; no reproducibility out of internal environment makes triage and solving bugs hard, especially for intermittent bugs or performance degradations. So a one way to enable this direction can be to open a new issue that focuses on and improves just the automated failure reporting tool - building a tool that is easy to integrate and allows any external testing framework to report as much as possible on failures in a standartized way.

krakov commented 5 years ago

There are some of things we do in QA that might have a potential to be released in one way or another to run as part of Presto (ignoring for now how much effort each would take) in the first model (code that goes into master and runs pre-release).

Interesting to hear your takes what would be useful?

TPC-DS data at scales 10, 1000 (1TB) and 10000 (10TB) in Parquets on S3, and will add more scale as we go. Compatible with the presto-benchto TPC-DS queries. We run those queries against S3 through our connector in various cluster configurations, but I believe this can be adapted to test presto-hive as well.
Geospatial data at few scale factors - we have a dataset that represents ride sharing in San Francisco (about 5B rows of 25M rides, anonymous data we can share based on open source data, Parquet file format on S3). The test runs a set of interesting / complex queries on the data - some of them use geofences, some of them JOINs, some of them are more classic BI. Even without the queries, this is a good dataset for testing.
"Matrix" test - a test the randomly generates data and queries, loads the data both to PostgreSQL and to Presto, runs the queries against both and compares results between the databases. To catch correctness errors / random rare bugs. Some of the framework (the random data generation part) is very standalone, some requires some work to decouple. That said, the type of queries we generate might be biased to catch bugs in our code, so not that useful. If someone wants to take on a project of building generic queries to generate we can try to partner. It might also find bugs in Postgres :-)
"Performance" test - a test that uses a flattened TPC-DS table of ~10B rows / ~30 columns and a set of simple queries. Runs every night in same environment(s), and all the hisotric results for each query are collected. Catches various performance degradations - both major ones (today is 50% worse than yesterday) but also gradual trends.
Tableau framework - Tableau have open sourced a testing framework (https://github.com/tableau/connector-plugin-sdk/tree/master/tdvt) with an MIT license, and it can be used to catch BI related regressions. It relies by itself on using Tableau to generate queries, which is not open source, is not a good fit for automated testing, and for sure can't be added to Apache Presto release testing. Our approach is to decouple the tests from the framework - put verification data in Parquet/CSV files on S3, and build a JSON file that describes all the SQL queries (±700) the framework generates and the expected result for each one. This is collected by running TDVT and recording its interaction to the JSON. Our internal code uses the JSON as input to run, but I kinda guess this approach can be adapted to other verification frameworks that just run a SQL and compare the results. Note that this requires some maintenance - each time Tableau release a new version of the verification we update the JSON file.

trinodb / trino

Automated stress tests #38