iter8-tools / iter8

Kubernetes release optimizer
https://iter8.tools
Apache License 2.0
256 stars 34 forks source link

Design the Istio mirroring + SLO validation experiment #1201

Closed sriumcp closed 2 years ago

sriumcp commented 2 years ago

Is your feature request related to a problem? Please describe. We have a task that supports reading metrics from DBs, and these metrics can be further used for SLO validation in assess task. We want to use these features to enable a traffic mirroring + SLO validation experiment.

Describe the solution you'd like

iter8 k launch Istio-mirroring ...

The above experiment should support minimally, the following values.

  1. baseline and candidate end-points
  2. weights
  3. prom URL (and may be auth for the prom URL) in order to read the metrics
  4. prometheus labels for candidate services that are needed to query the metrics
  5. SLOs
  6. Number of experiment loops + time interval between each loop (in other words, duration of the experiment)
Alan-Cha commented 2 years ago

Here is a draft of what the options can look like

name required description type default
baselineEndpoint no baseline endpoint string na
candidateEndpoint yes candidate endpoint string na
mirroringWeight no percentage of traffic to be mirrored integer, 0-100 100
prometheusUrl yes Prometheus endpoint string na
promLabels.xyz yes Prometheus labels for the candidate string na
SLOs.xyz no SLOs to be validated float na
loops no number of times metrics are collected integer 3
minutesBetweenLoops no number of minutes between each loop integer 1
hoursBetweenLoops no number of hours between each loop integer na
daysBetweenLoops no number of days between each loop integer na
schedule no CronJob schedule string na

Note: throw an error if more than one of the following: minutesBetweenLoops, hoursBetweenLoops, daysBetweenLoops, schedule


Example using some labels:

iter8 k launch istio-mirroring
--set baselineEndpoint=...
--set candidateEndpoint=...
--set mirroringWeight=50
--set promLabels.abc=...
--set promLabels.xyz=...
--set SLOs.istio-prom/error-rate=0 \
--set SLOs.istio-prom/latency-mean=50 \
--set SLOs.istio-prom/latency-p90=100 \
--set SLOs.istio-prom/latency-p'97\.5'=200
--set loops=10
--set timeBetweenLoops=60

Questions:

kalantar commented 2 years ago

Are multiple loops necessary as a first step? If so, is a loop about the experiment as a whole? Or a subset of tasks? When we consider tasks like readiness and notification, I suspect that it should be a subset of tasks.

kalantar commented 2 years ago

Are both a baseline endpoint and a candidate endpoint required? Can we think of an experiment as testing the quality of the candidate endpoint. Does it necessarily need to be a comparison to the baseline? Agree that we probably do want to do comparisons. But is it intrinsic?

kalantar commented 2 years ago

Why might we need minutesBetweenLoops hoursBetweenLoops and daysBetweenLoops? Can't we just use a timeBetweenLoops notion that can express all of these. For example, 3d4h8m4s. This is pretty standard in go.

Alan-Cha commented 2 years ago

Are multiple loops necessary as a first step? If so, is a loop about the experiment as a whole? Or a subset of tasks? When we consider tasks like readiness and notification, I suspect that it should be a subset of tasks.

I was originally thinking of it as the number of times we query the database for metrics, the overall length of the experiment. I have not considered looping with other tasks. What is your opinion?

Are both a baseline endpoint and a candidate endpoint required? Can we think of an experiment as testing the quality of the candidate endpoint. Does it necessarily need to be a comparison to the baseline?

Yes, I can see us just testing a candidate endpoint. In that case, I will make baseline not required.

Why might we need minutesBetweenLoops, hoursBetweenLoops, and daysBetweenLoops? Can't we just use a timeBetweenLoops notion that can express all of these. For example, 3d4h8m4s. This is pretty standard in go.

The logic behind these options is to allow a user to quickly create an experiment while covering the majority of use cases. For example, a user can spin up an mirroring experiment over the course of 30 minutes, or 12 hours, or over a week by using any one of these options. They are not intended to be used together for something like 3d4h8m4s and furthermore, timeBetweenLoops would require the user to know the format of the time that we want, which granted may not be a huge obstacle but will also lead to additional complexity like input validation and parsing the input into a cronjob schedule expression when we really just want something simple. Thoughts?

sriumcp commented 2 years ago

Are multiple loops necessary as a first step? If so, is a loop about the experiment as a whole? Or a subset of tasks? When we consider tasks like readiness and notification, I suspect that it should be a subset of tasks.

The user may want to mirror traffic over a period of a day, simply because they aren't receiving enough traffic, and it takes a day too collect enough. It is of course nice to be able to update the metrics and SLO validation status periodically within the experiment time period, so that up to date information is available through iter8 k report.

Are both a baseline endpoint and a candidate endpoint required? Can we think of an experiment as testing the quality of the candidate endpoint. Does it necessarily need to be a comparison to the baseline?

Perhaps baseline and candidate are misnomers, and come with baggage of past Iter8 usage. Perhaps we should use the nomenclature in this article. That would be source and target. I believe networking information about both source and target is needed in order to set up any mirroring experiment. Networking information and metrics information may have common elements, but serve different purposes in this experiment.

The logic behind these options is to allow a user to quickly create an experiment while covering the majority of use cases. For example, a user can spin up an mirroring experiment over the course of 30 minutes, or 12 hours, or over a week by using any one of these options. They are not intended to be used together for something like 3d4h8m4s and furthermore, timeBetweenLoops would require the user to know the format of the time that we want, which granted may not be a huge obstacle but will also lead to additional complexity like input validation and parsing the input into a cronjob schedule expression when we really just want something simple. Thoughts?

My personal preference is to stick to the cronjob schedule format and provide simple examples. We can give easy examples for once every thirty minutes */30 * * * *, once every six hours * */6 * * *, once every two days * * */2 * *, and so on. In other words, keep the template values simple (from a dev perspective), but provide illustrative examples from the end-user perspective. I think Kubernetes cronjob is a concept that is now familiar to the Kubernetes community. Also, other documentation besides what we provide is also available, not to mention resources like https://crontab.guru, which we can refer people to play further with the schedule.

Having said the above, we can certainly consider minutesBetweenLoops, hoursBetweenLoops, and daysBetweenLoops, in version 2 of the mirroring experiment chart, and not necessarily in version 1 (just to keep things simple in the MVP).

sriumcp commented 2 years ago

Some mirroring VS examples:

kubectl apply -f - <<EOF
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: httpbin
spec:
  hosts:
    - httpbin
  http:
  - route:
    - destination:
        host: httpbin
        subset: v1
      weight: 100
    mirror:
      host: httpbin
      subset: v2
    mirrorPercentage:
      value: 100.0
EOF
kalantar commented 2 years ago

Reading the above comments suggests that the notion of loop is very specific. The intent here seems to be to loop a single task. Furthermore, the reason for the looping does not seem to be intrinsic to the solution. Rather, it is driven by a desire to be able to provide "recent" feedback to the user. Since we don't know what recent means, we are suggesting a

In this context, I suggest we don't want to expose to the user the notion of loop/time. Alternatives might be

I confess I don't know how to do these and they might be impractical. However, I wanted to suggest as alternatives because they might be better as targets and match user need better.

Instead of specifying loop and interval an overall time of experiment might be provided. Or introduce the notion that the experiment does not end until it is deleted.

Should we decide on using a time, we already use the go notion of time for intervals, delays etc in other tasks. I see no reason to change this. I think cron time is overkill and more complicated than it needs to be. I agree that users aren't going to pick weird combinations of time but there is no need to do something different either.

kalantar commented 2 years ago

Perhaps baseline and candidate are misnomers, and come with baggage of past Iter8 usage. Perhaps we should use the nomenclature in this article. That would be source and target.

I agree to a change in nomenclature. Though I might even prefer something like source and mirror.

sriumcp commented 2 years ago

The intent here seems to be to loop a single task.

Loop through two tasks -- collect metrics, and validate SLOs.

Furthermore, the reason for the looping does not seem to be intrinsic to the solution. Rather, it is driven by a desire to be able to provide "recent" feedback to the user.

The desire is to accomplish what the CRD model of looping accomplished for us in Iter8 v0.7, without actually the need to write extra looping code -- we can simply piggyback on the existing Kubernetes cronjob mechanism, which does this heavy lifting already. Previously, the CRD + controller model used to update experiment status. Now the new model will simply update the result secret. That's the only difference in terms of the implementation.

From the end-user perspective, conceptually the looping is identical, though the fact that we're using a cronjob might mean slight variations in how we expose the looping options.

I also agree that using the cron schedule is more complex than simply asking for a time between loops/total duration; but may be it is an ok starting point for an MVP implementation of the mirroring experiment?... Because, with a cron schedule, we don't have to write a single line of extra code (in the template) to parse it further.

sriumcp commented 2 years ago

It would be nice to see the output of the following commands ....

# make sure ./charts folder has istio-mirroring folder underneath it

# update dependency for load-test-http
helm dependency update charts/istio-mirroring 

# generated manifests will be imperfect; release info would be missing; still useful
helm template charts/istio-morriring --set # whatever needs to be set

Producing the above output involves creating a cron-job template in iter8lib chart, using that to create istio-mirroring chart, making sure that are no syntax/other errors so that template command works.

Alan-Cha commented 2 years ago

I have drafted up some templates for the mirroring experiments here but I have some design questions.

You can try them out by using:

git clone https://github.com/Alan-Cha/iter8.git iter8-alan
cd iter8-alan
git checkout mirroring-temp
cd charts
helm dependency update istio-mirroring
helm template istio-mirroring/ --set url=http://127.0.0.1/get --set destination_workload=myapp --set destination_workload_namespace=default

which should produce the following:

# Source: istio-mirroring/templates/k8s.yaml
apiVersion: v1
kind: Secret
metadata:
  name: RELEASE-NAME-1-spec
stringData:
  experiment.yaml: |
    - task: collect-metrics-database
      with:
        versionInfo:
        - destination_workload: myapp
          destination_workload_namespace: default

          # TODO: Should we make this more generic using the below?
          #         null
    # task: validate service level objectives for app using
    # the metrics collected in an earlier task
    - task: assess-app-versions
      with:
        SLOs:
        - metric: istio/error-rate
          upperLimit: 0

  metrics.yaml: |
url: prom-url.xyz/api/v1/query
provider: Istio
method: GET
# Inputs for the template:
#   app                                       string
#   chart                                     string
#   connection_security_policy                string
#   destination_app                           string
#   destination_canonical_revision            string
#   destination_canonical_service             string
#   destination_cluster                       string
#   destination_principal                     string
#   destination_service                       string
#   destination_service_name                  string
#   destination_service_namespace             string
#   destination_version                       string
#   heritage                                  string
#   install_operator_istio_io_owning_resource string
#   instance                                  string
#   istio                                     string
#   istio_io_rev                              string
#   job                                       string
#   namespace                                 string
#   operator_istio_io_component               string
#   pod                                       string
#   pod_template_hash                         string
#   release                                   string
#   request_protocol                          string
#   response_code                             string
#   response_flags                            string
#   service_istio_io_canonical_name           string
#   service_istio_io_canonical_revision       string
#   sidecar_istio_io_inject                   string
#   source_app                                string
#   source_canonical_revision                 string
#   source_canonical_service                  string
#   source_cluster                            string
#   source_principal                          string
#   source_version                            string
#   source_workload                           string
#   source_workload_namespace                 string
#
# Inputs for the metrics (output of template):
#   destination_workload                      string
#   destination_workload_namespacee           string
#   StartingTime                 int64 (UNIX time stamp)
#
# Note: ElapsedTime is produced by Iter8
metrics:
- name: request-count
  type: counter
  description: |
    Number of requests
  params:
  - name: query
    value: |
      sum(last_over_time(istio_requests_total{
        reporter='source',
        {{- if .destination_workload }}
          destination_workload="{{.destination_workload}}",
        {{- end }}
        {{- if .destination_workload_namespace }}
          destination_workload_namespace="{{.destination_workload_namespace}}",
        {{- end }}
      }[{{.ElapsedTime}}s])) or on() vector(0)
  jqExpression: .data.result[0].value[1]
- name: error-count
  type: counter
  description: |
    Number of non-successful requests
  params:
  - name: query
    value: |
      sum(last_over_time(istio_requests_total{
        response_code=~'5..',
        reporter='source',
        {{- if .destination_workload }}
          destination_workload="{{.destination_workload}}",
        {{- end }}
        {{- if .destination_workload_namespace }}
          destination_workload_namespace="{{.destination_workload_namespace}}",
        {{- end }}
      }[{{.ElapsedTime}}s])) or on() vector(0)
  jqExpression: .data.result[0].value[1]
- name: error-rate
  type: gauge
  description: |
    Percentage of non-successful requests
  params:
  - name: query
    value: |
      sum(last_over_time(istio_requests_total{
        response_code=~'5..',
        reporter='source',
        {{- if .destination_workload }}
          destination_workload="{{.destination_workload}}",
        {{- end }}
        {{- if .destination_workload_namespace }}
          destination_workload_namespace="{{.destination_workload_namespace}}",
        {{- end }}
      }[{{.ElapsedTime}}s])) or on() vector(0)/sum(last_over_time(istio_requests_total{
        reporter='source',
        {{- if .destination_workload }}
          destination_workload="{{.destination_workload}}",
        {{- end }}
        {{- if .destination_workload_namespace }}
          destination_workload_namespace="{{.destination_workload_namespace}}",
        {{- end }}
      }[{{.ElapsedTime}}s])) or on() vector(0)
  jqExpression: .data.result.[0].value.[1]
- name: le500ms-latency-percentile
  type: gauge
  description: |
    Less than 500 ms latency
  params:
  - name: query
    value: |
      sum(last_over_time(istio_request_duration_milliseconds_bucket{
        le='500',
        reporter='source',
        {{- if .destination_workload }}
          destination_workload="{{.destination_workload}}",
        {{- end }}
        {{- if .destination_workload_namespace }}
          destination_workload_namespace="{{.destination_workload_namespace}}",
        {{- end }}
      }[{{.ElapsedTime}}s])) or on() vector(0)/sum(last_over_time(istio_request_duration_milliseconds_bucket{
        le='+Inf',
        reporter='source',
        {{- if .destination_workload }}
          destination_workload="{{.destination_workload}}",
        {{- end }}
        {{- if .destination_workload_namespace }}
          destination_workload_namespace="{{.destination_workload_namespace}}",
        {{- end }}
      }[{{.ElapsedTime}}s])) or on() vector(0)
  jqExpression: .data.result[0].value[1]
- name: mean-latency
  type: gauge
  description: |
    Mean latency
  params:
  - name: query
    value: |
      sum(last_over_time(istio_request_duration_milliseconds_sum{
        reporter='source',
        {{- if .destination_workload }}
          destination_workload="{{.destination_workload}}",
        {{- end }}
        {{- if .destination_workload_namespace }}
          destination_workload_namespace="{{.destination_workload_namespace}}",
        {{- end }}
      }[{{.ElapsedTime}}s])) or on() vector(0)/sum(last_over_time(istio_requests_total{
        reporter='source',
        {{- if .destination_workload }}
          destination_workload="{{.destination_workload}}",
        {{- end }}
        {{- if .destination_workload_namespace }}
          destination_workload_namespace="{{.destination_workload_namespace}}",
        {{- end }}
      }[{{.ElapsedTime}}s])) or on() vector(0)
  jqExpression: .data.result[0].value[1]
---
# Source: istio-mirroring/templates/k8s.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: RELEASE-NAME-1-spec-role
rules:
- apiGroups: [""]
  resources: ["secrets"]
  resourceNames: ["RELEASE-NAME-1-spec"]
  verbs: ["get"]
---
# Source: istio-mirroring/templates/k8s.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: RELEASE-NAME-1-result-role
rules:
- apiGroups: [""]
  resources: ["secrets"]
  resourceNames: ["RELEASE-NAME-1-result"]
  verbs: ["create", "get", "update"]
---
# Source: istio-mirroring/templates/k8s.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: RELEASE-NAME-1-spec-rolebinding
subjects:
- kind: ServiceAccount
  name: default
roleRef:
  kind: Role
  name: RELEASE-NAME-1-spec-role
  apiGroup: rbac.authorization.k8s.io
---
# Source: istio-mirroring/templates/k8s.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: RELEASE-NAME-1-result-rolebinding
subjects:
- kind: ServiceAccount
  name: default
roleRef:
  kind: Role
  name: RELEASE-NAME-1-result-role
  apiGroup: rbac.authorization.k8s.io
---
# Source: istio-mirroring/templates/k8s.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: RELEASE-NAME-1-job
spec:
  schedule: "*/1 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: iter8
            image: iter8-tools/iter8:0.9
            imagePullPolicy: Always
            command:
            - "/bin/sh"
            - "-c"
            - |
              iter8 k run --namespace default --group RELEASE-NAME --revision 1
          restartPolicy: Never
      backoffLimit: 0

There are some design questions that need to be answered. For example, I have created a number of istio-specific templates such as _k-spec-secret-istio.tpl and _task-istio.tpl. Is it possible to create something more generic? Is istio.metrics a good name for the _istio.metrics.tpl as well?

sriumcp commented 2 years ago
  1. User already followed Istio's mirroring tutorial to set up mirroring
  2. User already followed Istio's Prometheus tutorial to add the prom-add on
  3. Iter8 experiment focuses on SLO validation used Prom metrics ...

As far as the Istio is concerned ... there is a source and mirror version of the app ...

As far as Iter8 experiment is concerned, there is only app, which corresponds to the mirrored version.

helm template charts/istio-mirroring/ \
--set metric.url=http://127.0.0.1/get \
--set metric.labels.destination_workload=myapp \
--set metric.labels.destination_workload_namespace=default \
--set cron.schedule="1/1 * * * *" \
--set SLOs.istio/error-rate=0

The above is good for dev purposes. In reality, the user will invoke the above as follows ...

iter8 k launch -c istio-mirroring/ \
--set metric.url=http://127.0.0.1/get \
--set metric.labels.destination_workload=myapp \
--set metric.labels.destination_workload_namespace=default \
--set cron.schedule="1/1 * * * *" \
--set SLOs.istio/error-rate=0

Forbid concurrency through concurrencyPolicy in the cronjob template (please refer to cronjob API).


loops not supported by cronjob execution (though supported for cronjob storage).


Update the number of loops in the experiment result object.


sriumcp commented 2 years ago

@Alan-Cha At some point (perhaps after the CNCF presentation), we may want to consider the following changes ...

  1. Rename _task_rbac to _k_task_rbac
  2. Move versionInfo to metrics (simplifies helm chart just a little bit…) — authors of custom metrics do not need to understand the db_task template + all the metrics info will be in one place.