mozilla-services / contile

This is the back-end server for the Mozilla Tile Service (MTS)
https://mozilla-services.github.io/contile/
Mozilla Public License 2.0
19 stars 2 forks source link

Investigate stage load testing failures #230

Closed data-sync-user closed 3 years ago

data-sync-user commented 3 years ago

Recent load testing results show high level of failures that do not occur on production.

These may be due to the k8s cluster sizing itself to accommodate the load, but in my previous experience (syncstorage's k8s cluster) some initial errors are expected (see the Sync load test history) but these seem higher.

Let's double check that these errors only happen on startup of the load test, the stage k8s cluster is sizing itself correctly, and consider how we can more easily verify the stage environment if these errors are inevitable.

We may need to adjust our expected failure rate and or pre-warm the cluster (e.g. by running a short load test for sizing followed by the lengthier load test).

┆Issue is synchronized with this Jiraserver Task

data-sync-user commented 3 years ago

➤ Ankita Shrivastava commented:

Thank you Philip Jenvey for filing this!

data-sync-user commented 3 years ago

➤ Ankita Shrivastava commented:

Philip Jenvey Jon Buckley Let’s decide an acceptable failure% because the ideal numbers( failure<0.001% = pass, failure>0.001% = fail) seems too aspirational for now and we haven’t achieved it yet.

pjenvey commented 3 years ago

This is no longer an issue on stage. It's possibly due to a couple of changes to the cluster:

@ashrivastava-qa has also seen some unrelated high failure rates but these appear to be due to a contile-loadtester environment issue.