Infrastructure for Agones performance reporting

jkowalski commented 5 years ago

We need a solution for gathering and publishing/analyzing Agones performance over time to identify trends and performance regressions.

We need to solve 4 aspects of it:

How to write performance tests and instrument them for performance data collection?
How to store the results long-term?
How to present the performance results (to be able to deep dive into a test run, have comparison of two runs and see long-term trends)
How to run all of this periodically in a clean environment so we have comparable data over time?

Here's one proposal:

Use https://fortio.org/ for visualization (used by Istio, stores performance metrics in very simple JSON which we could quite easily produce ourselves)

Example visualizations: _Single run, Comparison of two runs and trends over time._ Updated links: Single run, Comparison of two runs and Trends over time
Instrument e2e tests (akin to #571) to emit Fortio-compatible metrics as JSON files, and store those in GCS bucket for long-term storage. We could have "official" flag for tests that run against clean cluster
Have a website (https://performance.agones.dev/ on appengine perhaps?) that hosts Fortio and periodically syncs data from GCS bucket for presentation purposes.
Have some cron job, that starts a clean GKE cluster on a schedule (e.g. 4 times a day), launches appropriate tests and uploads results to GCS for long-term storage and pushes them to Fortio for presentation. Hoping we can have a Makefile-based solution that we could drive using Cloud Build or something like that.

pm7h commented 5 years ago

Summarizing discussions on the design.

Architecture

The test infrastructure has three main components: test framework, storage and visualization, and the system under test. The test framework provides necessary modules for writing load and performance tests and is also responsible for generating test traffic. The storage and visualization component stores test results and provides graphs for single test runs, comparison of multiple runs, and trends over time. The system under test is determined by scenarios and configurations to be tested.

For better separation of the system under test from the framework, the test framework along with the storage and visualization components will be hosted on a separate cluster which will be running continuously. Load and performance tests will run against a test cluster which is created according to user-defined configurations. For each test run a new cluster is created and later destroyed. The separation of the system under test from the framework allows community-provided cluster configurations and tests to run against them.

Test Run Configuration

Single-run tests can be executed on a custom cluster which is destroyed after the run. For continuous runs, we will use Cloud Build to run different test scenarios. Test results will be in the format that is appropriate to the storage and visualization components and will be stored in a GCS bucket.

Code Structure

test/load/clusters: Dockerfiles and necessary scripts for creation and deletion of various test clusters
test/load/scenarios: test scenarios to run against test clusters
test/load/infra: test infrastructure including the test framework (e.g. Fortio, Locust), and test run configuration

markmandel commented 5 years ago

Some initial thoughts questions:

Some useful GCP products might be: Cloud Scheduler to set run schedules, Cloud Functions for some orchestration/glue (now we can do them in Go!), and I like the idea of running the tests in Cloud Build :+1: just might need a GCF to fire it off.
I was originally thinking different nodepools (rather than separate clusters), but the more I think about this, I don't think that will line up with our goals:
1. As we get more tests running in parrallel, there is more K8s API on master -- more than we'd likely originally have.
  1. There does end up being more work to manage k8s cluster keys, but that's not an impossible problem. There must be an API we can leverage (I can't seem to find it right now).
We can use deployment manager to template what our clusters look like.

But otherwise, I think this all makes sense.

@Kuqd -- I expect you have opinions. WDYT?

cyriltovena commented 5 years ago

I would definitively create a new cluster for each run. Cloud Scheduler is a smart idea we could definitively leverage that.

Now I don't think the e2e is enough to run the performance test, it's a good start but I think we want a test that apply load, probably using fortio or locust to create a lot of allocations concurrently.

I definitively need to investigate more on fortio. But from what I can read from @jkowalski and @pm7h we're on the right track.

Lastly how big that cluster must be ? We should aim to reach 250 allocations/sec at least.

pm7h commented 5 years ago

@Kuqd We were discussing load generation today. From what we understand Fortio does not provide a load generation framework, but we can simulate that by creating multiple parallel Goroutines in the test. This is basically what Locust does using Python's gevent.

In summary, Locust provides a better load test framework since it generates load and is light-weight. It also allows you to write code for your tests. However, the available visualizations are not great since everything I have seen is time-series based.

I'm interested to know why you think e2e is not enough for performance/load tests. What do you see as the downsides if we generate a large number of Goroutines and use that for load generation? This way we could take benefit of Fortio's graphing capabilities.

cyriltovena commented 5 years ago

Yes you're right, and I think our test plan is not that complicated. With go we will be definitively more flexible.

Should we write the load test using the same e2e framework ? Do you think it's interesting if it's a function of our e2e test suite ?

We could definitively run that 4 times day in the current e2e GKE cluster and replace the github PR CI with e2e on kind (docker).

cyriltovena commented 5 years ago

https://github.com/GoogleCloudPlatform/agones/blob/master/build/Makefile#L226

Should we use that target ?

pm7h commented 5 years ago

This target is good for stress testing fleet scaling but for testing allocations I would add another argument that says how many concurrent calls we should have. The test would then start a separate Goroutine for each. This is basically simulating what Locust is doing with Python's gevent. Does that make sense?

markmandel commented 5 years ago

@ilkercelikyilmaz - did you already write something like the above for #536 ?

pm7h commented 5 years ago

If #536 doesn't have a load parameter, I can write something similar to what we have for fleet_test and the target mentioned in https://github.com/GoogleCloudPlatform/agones/issues/573#issuecomment-464948872. And then we can emit metrics in Fortio format.

ilkercelikyilmaz commented 5 years ago

I implemented something using the e2e framework for my own testing.

@pm7h I can show you what I implemented tomorrow and you can decide how you want to implement the load test for allocation.

aLekSer commented 4 years ago

Spent some time gathering what we have for completing this task: 1) make stress-test-e2e - create a Fortio formatted JSON file with results (different percentiles and QPS).

2) Also there is @ilkercelikyilmaz continuous hours running test, could be found here which helps to find memory, goroutines leaks and performance over time of operation.

For stress-test-e2e results were uploaded into GC bucket and we can see them in a fortio:

fortio server -sync https://storage.googleapis.com/fortio-sync-2

There we can compare results of different versions. So what we don't have is a one script to:

create cluster
install Agones of a specific version
run stress-test-e2e
upload the data into the bucket, use this bucket as an input for the Fortio server.

aLekSer commented 4 years ago

There is an option to run this stress-test-e2e in a Prow Job, an example on how Istio use Fortio in a Prow Job could be found here: https://github.com/istio/istio/wiki/Working-with-Prow https://prow.istio.io/?job=daily-performance-benchmark https://prow.istio.io/view/gcs/istio-prow/logs/daily-performance-benchmark/151 In a full log there could be seen that fortio is set in these 14 hours long Benchmark tests.

+ setup_fortio_and_prometheus
+ setup_metrics
++ kubectl get services -n twopods-istio fortioclient -o 'jsonpath={.status.loadBalancer.ingress[0].ip}'

https://github.com/istio/tools/blob/master/perf/benchmark/run_benchmark_job.sh

aLekSer commented 4 years ago

There was an opinion, that we should use https://mako.dev/ for performance testing. Here is one of the examples on how to use it: https://github.com/knative/serving/commit/a0a32a7895445f9c71b0be9a8c2c0d1b52d75c99

aLekSer commented 4 years ago

Made a request to create benchmark for the project: https://github.com/google/mako/issues/9

roberthbailey commented 1 year ago

Since this hasn't been updated in over 2 years I've marked it as stale.

markmandel commented 1 year ago

Let's close this, and we can restart it if necessary - possibly with different profiling tools.

googleforgames / agones