Create a Benchmark suite with example workloads

amosomokpo commented 2 weeks ago

A benchmark suite that generates load on the memory subsystem of nodes in a Kubernetes cluster. The suite should create metrics and establish baselines for memory noisy neighbor detection. Then, it should show improvements to noisy neighbor detection using the memory collector.

Automate the deployment of well-known industry workloads for observability in a kubernetes cluster testbed.
Deploy telemetry collectors available today to collect metrics on the memory subsystem of kubernetes nodes from academia and industry.
Send the telemetry data to a metrics storage and display the memory metrics in Grafana.

yonch commented 2 weeks ago

Do we need bare metal Nodes with Intel RDT and AMD QoS support as part of the kube cluster?

The Telegraf RDT plugin requires RDT support.

To your question about bare metal (vs VMs), I'm not 100% sure we need bare metal. Some hypervisors might allow access to RDT if you provision a whole physical CPU in your VM.

I'm assuming it's also a matter of cost, whether relatively cheap bare metal VMs available... For benchmarking and testing we can use clouds with less services and less reliability guarantees.

yonch commented 2 weeks ago

Also, the test framework might need to be executed in a CI/CD pipeline.

Yes that would be ideal! The question is can we spin up bare-metal VMs per run. I know the GCP console allows spinning up nodes with a timeout. I think it might be worth a search for a GitHub action that supports spinning up VMs for testing.

Seems like we need to get a short list of cloud providers to run the tests on, and an infra to spin those up ad-hoc (and the framework would need to support whatever provider we choose?)

yonch commented 2 weeks ago

Let's set up GitHub actions and run the build/test/deploy cycle, assuming the Influxdb collector is part of the CI. This will enable us to flesh out the integration test framework, maybe initially with integration tests around the Influxdb collector, and then add support to the test framework for memory collector.

Agreed! Seeing what Telegraf has should give us a good understanding of what's missing and how to gather that.

amosomokpo commented 2 weeks ago

Great! I will create a GitHub issue to investigate support for Intel RDT across cloud providers in VMs and Baremetals. Maybe ask in the telegraf slack channel if they have recommendations or documentation handy on what cloud providers support that plugin today and extend the list. I would add the list I come up with to our wiki.

Do we need bare metal Nodes with Intel RDT and AMD QoS support as part of the kube cluster?

The Telegraf RDT plugin requires RDT support.

To your question about bare metal (vs VMs), I'm not 100% sure we need bare metal. Some hypervisors might allow access to RDT if you provision a whole physical CPU in your VM.

I'm assuming it's also a matter of cost, whether relatively cheap bare metal VMs available... For benchmarking and testing we can use clouds with less services and less reliability guarantees.

amosomokpo commented 2 weeks ago

Also, the test framework might need to be executed in a CI/CD pipeline.

Yes that would be ideal! The question is can we spin up bare-metal VMs per run. I know the GCP console allows spinning up nodes with a timeout. I think it might be worth a search for a GitHub action that supports spinning up VMs for testing.

Seems like we need to get a short list of cloud providers to run the tests on, and an infra to spin those up ad-hoc (and the framework would need to support whatever provider we choose?)

Depending on the target cloud, we should look into a couple of Github actions. See action for terraform that can both provision and reclaim the test nodes or more cloud-specific actions (https://github.com/google-github-actions/setup-gcloud). We need an eventing/webhook action to listen and trigger the VM/Baremetal reclamation workflow. See - https://docs.github.com/en/actions/writing-workflows/choosing-when-your-workflow-runs/events-that-trigger-workflows

Either way, the CI/CD pipeline deserves its issue, too. I will create one.

amosomokpo commented 2 weeks ago

Let's set up GitHub actions and run the build/test/deploy cycle, assuming the Influxdb collector is part of the CI. This will enable us to flesh out the integration test framework, maybe initially with integration tests around the Influxdb collector and then adding support to the test framework for the memory collector.

Agreed! Seeing what Telegraf has should give us a good understanding of what's missing and how to gather that.

We should also create a separate issue to document specific integration test cases that should run once the cluster is up with collectors, starting with the telegraf collector. Two separate integration cases could be for the different CPU vendors (Intel RDT and AMD QoS ). Then, up the stack...

yonch commented 1 week ago

Many observability deployments use GCP's microservices-demo as a demonstration workload, and I think that would be a good start.

There are more workloads in the so called DeathStarBench (paper, website, repo). Prof. @delimitrou also did research into cache and memory bandwidth noisy neighbor in papers such as ibench and PARTIES.

If we are able to run Microservices-demo to test, I think we'd be off to a good start.

cc @delimitrou if you have an opinion of the best workload to show memory bandwidth and cache noisy neighbor

perfpod / memory-collector

Create a Benchmark suite with example workloads #2