discussion: how to acquire a consistent performance testing environment

lmatz commented 1 year ago

Recently, the performance testing pipeline shows that the performance of a query can fluctuate a lot (3X difference) on different days (once a day).

But when I run the query manually on an EC2 machine that is manually started, the performance is stable across multiple executions.

We guessed that the requested resources are allocated to different types of instances (different network bandwidth limitations by AWS or different types of CPU). After @huangjw806 and @cyliu0 helped look into CloudTrail EC2 Events, we found that the performance can be good and bad on the same machine.

Since the requested resources are 8C 16G and the machine is either c5.9xLarge or c5a.8xlarge, both 32C 64G, now it is suspected that the machine may run something else concurrently so that the performance is unstable.

So wonder whether there exists some mechanism to acquire a consistent environment, e.g. occupy a machine exclusively? Or verify the cause of inconsistency?

cc: @huangjw806 @cyliu0 correct me if I am wrong.

lmatz commented 1 year ago

As the performance testing pipeline is not much different from an average user (looking for stable/reproducible performance), it seems to imply that the mechanism above is a general functionality that can also benefit Cloud users?

cyliu0 commented 1 year ago

Since the requested resources are 8C 16G and the machine is either c5.9xLarge or c5a.8xlarge, both 32C 64G, now it is suspected that the machine may run something else concurrently so that the performance is unstable.

In addition, we need a dedicated, exclusive, and consistent RW cluster for performance testing which means the cluster should always be deployed on the same type of dedicated EC2 instances. We don't want the clusters on different EC2 types which may lead to different CPU frequencies like the c5.9xlarge & c5a.8xlarge. The users can't accept this either if they got different performance when they paid the same money. But I'm wondering if we need to move this discussion to risingwave-cloud? @Imatz c5: C5 and C5d 12xlarge, 24xlarge, and metal instance sizes feature custom 2nd generation Intel Xeon Scalable Processors (Cascade Lake 8275CL) with a sustained all core Turbo frequency of 3.6GHz and single core turbo frequency of up to 3.9GHz. c5a: 2nd generation AMD EPYC 7002 series processors (AMD EPYC 7R32) running at frequencies up to 3.3 GHz

arkbriar commented 1 year ago

@lmatz @cyliu0 They are all good points, but I agree with @cyliu0 that we should move the discussion to the cloud because the main target of the operator isn't where the deployment locates but how the deployment works. In other words, it's about how to use the operator other than how to develop it. In our CRD, the operator has provided the flexibility of where to deploy and how to set additional resource limits like bandwidth of network devices on AWS, with node selectors, annotations, and other fields. Then it would be an infra/platform problem when talking about a stable and reliable benchmark environment. Actually, there's a project that aims to reduce cloud costs but has several related work items. Let me open an issue on our Linear and discuss it more.

lmatz commented 1 year ago

Thanks, sure, let's discuss in the context of cloud

arkbriar commented 1 year ago

Let's move the discussion here: https://linear.app/risingwave-labs/issue/CLOUD-869/consistent-performance-testing-environment

risingwavelabs / risingwave-operator

discussion: how to acquire a consistent performance testing environment #348