Closed sriumcp closed 2 years ago
Here is a draft of what the options can look like
name | required | description | type | default |
---|---|---|---|---|
baselineEndpoint |
no | baseline endpoint | string | na |
candidateEndpoint |
yes | candidate endpoint | string | na |
mirroringWeight |
no | percentage of traffic to be mirrored | integer, 0-100 | 100 |
prometheusUrl |
yes | Prometheus endpoint | string | na |
promLabels.xyz | yes | Prometheus labels for the candidate | string | na |
SLOs.xyz | no | SLOs to be validated | float | na |
loops |
no | number of times metrics are collected | integer | 3 |
minutesBetweenLoops |
no | number of minutes between each loop | integer | 1 |
hoursBetweenLoops |
no | number of hours between each loop | integer | na |
daysBetweenLoops |
no | number of days between each loop | integer | na |
schedule |
no | CronJob schedule | string | na |
Note: throw an error if more than one of the following: minutesBetweenLoops
, hoursBetweenLoops
, daysBetweenLoops
, schedule
Example using some labels:
iter8 k launch istio-mirroring
--set baselineEndpoint=...
--set candidateEndpoint=...
--set mirroringWeight=50
--set promLabels.abc=...
--set promLabels.xyz=...
--set SLOs.istio-prom/error-rate=0 \
--set SLOs.istio-prom/latency-mean=50 \
--set SLOs.istio-prom/latency-p90=100 \
--set SLOs.istio-prom/latency-p'97\.5'=200
--set loops=10
--set timeBetweenLoops=60
Questions:
Are multiple loops necessary as a first step? If so, is a loop about the experiment as a whole? Or a subset of tasks? When we consider tasks like readiness and notification, I suspect that it should be a subset of tasks.
Are both a baseline endpoint and a candidate endpoint required? Can we think of an experiment as testing the quality of the candidate endpoint. Does it necessarily need to be a comparison to the baseline? Agree that we probably do want to do comparisons. But is it intrinsic?
Why might we need minutesBetweenLoops
hoursBetweenLoops
and daysBetweenLoops
? Can't we just use a timeBetweenLoops notion that can express all of these. For example, 3d4h8m4s
. This is pretty standard in go.
Are multiple loops necessary as a first step? If so, is a loop about the experiment as a whole? Or a subset of tasks? When we consider tasks like readiness and notification, I suspect that it should be a subset of tasks.
I was originally thinking of it as the number of times we query the database for metrics, the overall length of the experiment. I have not considered looping with other tasks. What is your opinion?
Are both a baseline endpoint and a candidate endpoint required? Can we think of an experiment as testing the quality of the candidate endpoint. Does it necessarily need to be a comparison to the baseline?
Yes, I can see us just testing a candidate endpoint. In that case, I will make baseline not required.
Why might we need
minutesBetweenLoops
,hoursBetweenLoops
, anddaysBetweenLoops
? Can't we just use atimeBetweenLoops
notion that can express all of these. For example, 3d4h8m4s. This is pretty standard in go.
The logic behind these options is to allow a user to quickly create an experiment while covering the majority of use cases. For example, a user can spin up an mirroring experiment over the course of 30 minutes, or 12 hours, or over a week by using any one of these options. They are not intended to be used together for something like 3d4h8m4s and furthermore, timeBetweenLoops
would require the user to know the format of the time that we want, which granted may not be a huge obstacle but will also lead to additional complexity like input validation and parsing the input into a cronjob schedule expression when we really just want something simple. Thoughts?
Are multiple loops necessary as a first step? If so, is a loop about the experiment as a whole? Or a subset of tasks? When we consider tasks like readiness and notification, I suspect that it should be a subset of tasks.
The user may want to mirror traffic over a period of a day, simply because they aren't receiving enough traffic, and it takes a day too collect enough. It is of course nice to be able to update the metrics and SLO validation status periodically within the experiment time period, so that up to date information is available through iter8 k report
.
Are both a baseline endpoint and a candidate endpoint required? Can we think of an experiment as testing the quality of the candidate endpoint. Does it necessarily need to be a comparison to the baseline?
Perhaps baseline
and candidate
are misnomers, and come with baggage of past Iter8 usage. Perhaps we should use the nomenclature in this article. That would be source
and target
. I believe networking information about both source and target is needed in order to set up any mirroring experiment. Networking information and metrics information may have common elements, but serve different purposes in this experiment.
The logic behind these options is to allow a user to quickly create an experiment while covering the majority of use cases. For example, a user can spin up an mirroring experiment over the course of 30 minutes, or 12 hours, or over a week by using any one of these options. They are not intended to be used together for something like 3d4h8m4s and furthermore, timeBetweenLoops would require the user to know the format of the time that we want, which granted may not be a huge obstacle but will also lead to additional complexity like input validation and parsing the input into a cronjob schedule expression when we really just want something simple. Thoughts?
My personal preference is to stick to the cronjob
schedule format and provide simple examples. We can give easy examples for once every thirty minutes */30 * * * *
, once every six hours * */6 * * *
, once every two days * * */2 * *
, and so on. In other words, keep the template values simple (from a dev perspective), but provide illustrative examples from the end-user perspective. I think Kubernetes cronjob is a concept that is now familiar to the Kubernetes community. Also, other documentation besides what we provide is also available, not to mention resources like https://crontab.guru, which we can refer people to play further with the schedule.
Having said the above, we can certainly consider minutesBetweenLoops, hoursBetweenLoops, and daysBetweenLoops
, in version 2 of the mirroring experiment chart, and not necessarily in version 1 (just to keep things simple in the MVP).
Some mirroring VS examples:
kubectl apply -f - <<EOF
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: httpbin
spec:
hosts:
- httpbin
http:
- route:
- destination:
host: httpbin
subset: v1
weight: 100
mirror:
host: httpbin
subset: v2
mirrorPercentage:
value: 100.0
EOF
https://istiobyexample.dev/traffic-mirroring/
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: encode-mirror
spec:
hosts:
- encode
http:
- route:
- destination:
host: encode
subset: prod
weight: 100
mirror:
host: encode
subset: test
https://tech.trivago.com/post/2020-06-10-crossclustertrafficmirroringwithistio/
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: httpbin-virtualservice
namespace: frontend
spec:
hosts:
- httpbin
http:
- route:
- destination:
host: httpbin
port:
number: 8000
weight: 100
mirror:
host: httpin.frontend.stage.eu.trv.cloud
port:
number: 80
mirror_percent: 100
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: web-mirror-vs
spec:
hosts:
- web-service.default.svc.cluster.local
http:
- route:
- destination:
host: web-service.default.svc.cluster.local
subset: v1
weight: 100
mirror:
host: web-service.default.svc.cluster.local
subset: v2
mirror_percent: 100
https://betterprogramming.pub/traffic-mirroring-in-kubernetes-using-istio-dad0976b4e1
$ kubectl apply -f - <<EOF
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: nginx
spec:
hosts:
- nginx
http:
- route:
- destination:
host: nginx
subset: v1
weight: 100
---
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: nginx
spec:
host: nginx
subsets:
- name: v1
labels:
version: v1
- name: v2
labels:
version: v2
EOF
Multi-cluster traffic mirroring: https://piotrminkowski.com/2021/07/12/multicluster-traffic-mirroring-with-istio-and-kind/
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: callme-service-route
spec:
hosts:
- callme-service
http:
- route:
- destination:
host: callme-service
subset: v1
weight: 100
mirror:
host: callme-service
subset: v2
mirrorPercentage:
value: 50.
Reading the above comments suggests that the notion of loop is very specific. The intent here seems to be to loop a single task. Furthermore, the reason for the looping does not seem to be intrinsic to the solution. Rather, it is driven by a desire to be able to provide "recent" feedback to the user. Since we don't know what recent means, we are suggesting a
In this context, I suggest we don't want to expose to the user the notion of loop/time. Alternatives might be
iter8 report
is called if it has been a long timeI confess I don't know how to do these and they might be impractical. However, I wanted to suggest as alternatives because they might be better as targets and match user need better.
Instead of specifying loop and interval an overall time of experiment might be provided. Or introduce the notion that the experiment does not end until it is deleted.
Should we decide on using a time, we already use the go notion of time for intervals, delays etc in other tasks. I see no reason to change this. I think cron time is overkill and more complicated than it needs to be. I agree that users aren't going to pick weird combinations of time but there is no need to do something different either.
Perhaps baseline and candidate are misnomers, and come with baggage of past Iter8 usage. Perhaps we should use the nomenclature in this article. That would be source and target.
I agree to a change in nomenclature. Though I might even prefer something like source
and mirror
.
The intent here seems to be to loop a single task.
Loop through two tasks -- collect metrics, and validate SLOs.
Furthermore, the reason for the looping does not seem to be intrinsic to the solution. Rather, it is driven by a desire to be able to provide "recent" feedback to the user.
The desire is to accomplish what the CRD model of looping accomplished for us in Iter8 v0.7, without actually the need to write extra looping code -- we can simply piggyback on the existing Kubernetes cronjob mechanism, which does this heavy lifting already. Previously, the CRD + controller model used to update experiment status. Now the new model will simply update the result secret. That's the only difference in terms of the implementation.
From the end-user perspective, conceptually the looping is identical, though the fact that we're using a cronjob might mean slight variations in how we expose the looping options.
I also agree that using the cron schedule is more complex than simply asking for a time between loops/total duration; but may be it is an ok starting point for an MVP implementation of the mirroring experiment?... Because, with a cron schedule, we don't have to write a single line of extra code (in the template) to parse it further.
It would be nice to see the output of the following commands ....
# make sure ./charts folder has istio-mirroring folder underneath it
# update dependency for load-test-http
helm dependency update charts/istio-mirroring
# generated manifests will be imperfect; release info would be missing; still useful
helm template charts/istio-morriring --set # whatever needs to be set
Producing the above output involves creating a cron-job template in iter8lib
chart, using that to create istio-mirroring
chart, making sure that are no syntax/other errors so that template command works.
I have drafted up some templates for the mirroring experiments here but I have some design questions.
You can try them out by using:
git clone https://github.com/Alan-Cha/iter8.git iter8-alan
cd iter8-alan
git checkout mirroring-temp
cd charts
helm dependency update istio-mirroring
helm template istio-mirroring/ --set url=http://127.0.0.1/get --set destination_workload=myapp --set destination_workload_namespace=default
which should produce the following:
# Source: istio-mirroring/templates/k8s.yaml
apiVersion: v1
kind: Secret
metadata:
name: RELEASE-NAME-1-spec
stringData:
experiment.yaml: |
- task: collect-metrics-database
with:
versionInfo:
- destination_workload: myapp
destination_workload_namespace: default
# TODO: Should we make this more generic using the below?
# null
# task: validate service level objectives for app using
# the metrics collected in an earlier task
- task: assess-app-versions
with:
SLOs:
- metric: istio/error-rate
upperLimit: 0
metrics.yaml: |
url: prom-url.xyz/api/v1/query
provider: Istio
method: GET
# Inputs for the template:
# app string
# chart string
# connection_security_policy string
# destination_app string
# destination_canonical_revision string
# destination_canonical_service string
# destination_cluster string
# destination_principal string
# destination_service string
# destination_service_name string
# destination_service_namespace string
# destination_version string
# heritage string
# install_operator_istio_io_owning_resource string
# instance string
# istio string
# istio_io_rev string
# job string
# namespace string
# operator_istio_io_component string
# pod string
# pod_template_hash string
# release string
# request_protocol string
# response_code string
# response_flags string
# service_istio_io_canonical_name string
# service_istio_io_canonical_revision string
# sidecar_istio_io_inject string
# source_app string
# source_canonical_revision string
# source_canonical_service string
# source_cluster string
# source_principal string
# source_version string
# source_workload string
# source_workload_namespace string
#
# Inputs for the metrics (output of template):
# destination_workload string
# destination_workload_namespacee string
# StartingTime int64 (UNIX time stamp)
#
# Note: ElapsedTime is produced by Iter8
metrics:
- name: request-count
type: counter
description: |
Number of requests
params:
- name: query
value: |
sum(last_over_time(istio_requests_total{
reporter='source',
{{- if .destination_workload }}
destination_workload="{{.destination_workload}}",
{{- end }}
{{- if .destination_workload_namespace }}
destination_workload_namespace="{{.destination_workload_namespace}}",
{{- end }}
}[{{.ElapsedTime}}s])) or on() vector(0)
jqExpression: .data.result[0].value[1]
- name: error-count
type: counter
description: |
Number of non-successful requests
params:
- name: query
value: |
sum(last_over_time(istio_requests_total{
response_code=~'5..',
reporter='source',
{{- if .destination_workload }}
destination_workload="{{.destination_workload}}",
{{- end }}
{{- if .destination_workload_namespace }}
destination_workload_namespace="{{.destination_workload_namespace}}",
{{- end }}
}[{{.ElapsedTime}}s])) or on() vector(0)
jqExpression: .data.result[0].value[1]
- name: error-rate
type: gauge
description: |
Percentage of non-successful requests
params:
- name: query
value: |
sum(last_over_time(istio_requests_total{
response_code=~'5..',
reporter='source',
{{- if .destination_workload }}
destination_workload="{{.destination_workload}}",
{{- end }}
{{- if .destination_workload_namespace }}
destination_workload_namespace="{{.destination_workload_namespace}}",
{{- end }}
}[{{.ElapsedTime}}s])) or on() vector(0)/sum(last_over_time(istio_requests_total{
reporter='source',
{{- if .destination_workload }}
destination_workload="{{.destination_workload}}",
{{- end }}
{{- if .destination_workload_namespace }}
destination_workload_namespace="{{.destination_workload_namespace}}",
{{- end }}
}[{{.ElapsedTime}}s])) or on() vector(0)
jqExpression: .data.result.[0].value.[1]
- name: le500ms-latency-percentile
type: gauge
description: |
Less than 500 ms latency
params:
- name: query
value: |
sum(last_over_time(istio_request_duration_milliseconds_bucket{
le='500',
reporter='source',
{{- if .destination_workload }}
destination_workload="{{.destination_workload}}",
{{- end }}
{{- if .destination_workload_namespace }}
destination_workload_namespace="{{.destination_workload_namespace}}",
{{- end }}
}[{{.ElapsedTime}}s])) or on() vector(0)/sum(last_over_time(istio_request_duration_milliseconds_bucket{
le='+Inf',
reporter='source',
{{- if .destination_workload }}
destination_workload="{{.destination_workload}}",
{{- end }}
{{- if .destination_workload_namespace }}
destination_workload_namespace="{{.destination_workload_namespace}}",
{{- end }}
}[{{.ElapsedTime}}s])) or on() vector(0)
jqExpression: .data.result[0].value[1]
- name: mean-latency
type: gauge
description: |
Mean latency
params:
- name: query
value: |
sum(last_over_time(istio_request_duration_milliseconds_sum{
reporter='source',
{{- if .destination_workload }}
destination_workload="{{.destination_workload}}",
{{- end }}
{{- if .destination_workload_namespace }}
destination_workload_namespace="{{.destination_workload_namespace}}",
{{- end }}
}[{{.ElapsedTime}}s])) or on() vector(0)/sum(last_over_time(istio_requests_total{
reporter='source',
{{- if .destination_workload }}
destination_workload="{{.destination_workload}}",
{{- end }}
{{- if .destination_workload_namespace }}
destination_workload_namespace="{{.destination_workload_namespace}}",
{{- end }}
}[{{.ElapsedTime}}s])) or on() vector(0)
jqExpression: .data.result[0].value[1]
---
# Source: istio-mirroring/templates/k8s.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: RELEASE-NAME-1-spec-role
rules:
- apiGroups: [""]
resources: ["secrets"]
resourceNames: ["RELEASE-NAME-1-spec"]
verbs: ["get"]
---
# Source: istio-mirroring/templates/k8s.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: RELEASE-NAME-1-result-role
rules:
- apiGroups: [""]
resources: ["secrets"]
resourceNames: ["RELEASE-NAME-1-result"]
verbs: ["create", "get", "update"]
---
# Source: istio-mirroring/templates/k8s.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: RELEASE-NAME-1-spec-rolebinding
subjects:
- kind: ServiceAccount
name: default
roleRef:
kind: Role
name: RELEASE-NAME-1-spec-role
apiGroup: rbac.authorization.k8s.io
---
# Source: istio-mirroring/templates/k8s.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: RELEASE-NAME-1-result-rolebinding
subjects:
- kind: ServiceAccount
name: default
roleRef:
kind: Role
name: RELEASE-NAME-1-result-role
apiGroup: rbac.authorization.k8s.io
---
# Source: istio-mirroring/templates/k8s.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: RELEASE-NAME-1-job
spec:
schedule: "*/1 * * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: iter8
image: iter8-tools/iter8:0.9
imagePullPolicy: Always
command:
- "/bin/sh"
- "-c"
- |
iter8 k run --namespace default --group RELEASE-NAME --revision 1
restartPolicy: Never
backoffLimit: 0
There are some design questions that need to be answered. For example, I have created a number of istio-specific templates such as _k-spec-secret-istio.tpl
and _task-istio.tpl
. Is it possible to create something more generic? Is istio.metrics
a good name for the _istio.metrics.tpl
as well?
As far as the Istio is concerned ... there is a source
and mirror
version of the app ...
As far as Iter8 experiment is concerned, there is only app
, which corresponds to the mirrored version.
helm template charts/istio-mirroring/ \
--set metric.url=http://127.0.0.1/get \
--set metric.labels.destination_workload=myapp \
--set metric.labels.destination_workload_namespace=default \
--set cron.schedule="1/1 * * * *" \
--set SLOs.istio/error-rate=0
The above is good for dev purposes. In reality, the user will invoke the above as follows ...
iter8 k launch -c istio-mirroring/ \
--set metric.url=http://127.0.0.1/get \
--set metric.labels.destination_workload=myapp \
--set metric.labels.destination_workload_namespace=default \
--set cron.schedule="1/1 * * * *" \
--set SLOs.istio/error-rate=0
Forbid concurrency through concurrencyPolicy
in the cronjob template (please refer to cronjob API).
loops
not supported by cronjob execution (though supported for cronjob storage).
Update the number of loops in the experiment result object.
@Alan-Cha At some point (perhaps after the CNCF presentation), we may want to consider the following changes ...
Is your feature request related to a problem? Please describe. We have a task that supports reading metrics from DBs, and these metrics can be further used for SLO validation in assess task. We want to use these features to enable a traffic mirroring + SLO validation experiment.
Describe the solution you'd like
The above experiment should support minimally, the following values.