scp756-221 / term-project-cloudriven

term-project-cloudriven created by GitHub Classroom
0 stars 1 forks source link

Understand Gatling/Grafana/Prometheus - Andrew #9

Closed beinfluential88 closed 2 years ago

beinfluential88 commented 2 years ago

In order for the team to come up with a concrete plan, we need to understand how Gatling is used for load testing and how Grafana and Prometheus allow us to monitor performance/load metrics.

beinfluential88 commented 2 years ago

Procedure to implement Gatling

Objective: With the aim of monitoring the performance of our system and detecting imminent system failures preemptively, we are planning to implement the following tools for system analysis purposes.

  1. Gatling: A load-testing tool that mimics the behavior of a user. (Heavier load can be applied to the system by increasing the number of synthetic users)
  2. Grafana: A dashboard tool that can be used to real-time monitoring of the system performance. Grafana queries Prometheus to retrieve relevant metrics from Prometheus, and these queries can be adjusted for different analysis.
  3. Prometheus: A system for gathering and storing metrics as a time-series database.

Pre-requisite

  1. AWS EKS cluster should be up and running which will serve the 4 microservices, db (database service), s1 (user service), s2 (music service), and s3 (playlist service). a. Starting a fresh EKS cluster. make -f eks.mak start

  2. Ensure AWS DynamoDB is initialized. The tables have to be available for db (database service) to serve the other 3 microservices. (s1/s2/s3) aws dynamodb list-tables

  3. Provision the cluster. This includes; a. Create a namespace within the cluster in which applications will be placed. kubectl create ns c756ns kubectl config set-context --current --namespace=c756ns b. Provision the Kubernetes cluster. This includes “Installing Istio”, “Installing Prometheus stack by calling obs.mak recursively”, and “Deploying and monitoring the four microservices”. make -f k8s.mak provision

  4. Get the Grafana URL using which we can access the dashboard. make -f k8s.mak grafana-url • User: admin • Password: prom-operator Note: The hostname is obtained from ‘istio” namespace. kubectl get -n istio-system svc/grafana-ingress -o jsonpath="{.status.loadBalancer.ingress[0]['ip','hostname']}" Parameters: 1: path to kubectl 2: namespace 3: the resource to query (typically an svc)

How does Gatling work?

1. We will be using a Gatling docker image which will allow us to create and apply synthetic load to our system. ghcr.io/scp-2021-jan-cmpt-756/gatling:3.4.2

2. Scenarios of user behavior are defined in “ReadTables.scala”. a. The original package name is defined, which will be used to trigger a Gatling instance later. package proj756 b. The required imports for Gatling to work. import scala.concurrent.duration._ import io.gatling.core.Predef._ import io.gatling.http.Predef._ c. “Utility” object envVarToInt(“USERS”, 1) - Utility to get an Int from an environment variable. (e.g., The number of users is defined under “docker container run” command.) envVar("CLUSTER_IP", "127.0.0.1") - Utility to get a string from an environment variable. (e.g., The cluster IP is defined under “docker container run” command.) d. “RMusic” and “RUser” objects – Scenarios defined to be tested for respective services. Sending an HTTP get request with {UUID} continuously every one second. Note: “eager()” loads the whole data in memory before the Simulation starts, saving disk access at runtime. “random()” randomly picks an entry in the sequence. “circular()” goes back to the top of the sequence once the end is reached. e. “RUserVarying” and “RMusicVarying” objects – Scenarios defined to be tested for respective services. Sending a HTTP get request with {UUID} continuously with different intervals between calls. (Each interval is randomly selected between 1 and 60 seconds) f. “ReadTablesSim” class – This class inherits properties from “Simulation” class and used to define HTTP protocols for simulations. (e.g., cluster IP is read from environment variables defined under “docker container run” command) g. “ReadUserSim” and “ReadMusicSim” classes – These classes are directly called by “docker container run” command, which will inject independent users (as defined in “RMusic” and “RUser” objects) via HTTP protocols defined in “ReadTablesSim” class h. “ReadBothVaryingSim” class – This class is directly called by “docker container run” command, which will inject concurrent users (as defined in “RMusicVarying” and “ReadUserVarying” objects) via HTTP protocols defined in “ReadTablesSim” class. Note: There are 2 types of workload model for injection. - Open vs. Closed • Closed systems, where you control the concurrent number of users. Closed system are system where the number of concurrent users is capped. At full capacity, a new user can effectively enter the system only once another exits. • Open systems, where you control the arrival rate of users. Open systems have no control over the number of concurrent users: users keep on arriving even though applications have trouble serving them. Note: For closed model, We have two methods that we use to inject users. • constantConcurrentUsers(nbUsers).during(duration): Inject so that number of concurrent users in the system is constant • rampConcurrentUsers(fromNbUsers).to(toNbUsers).during(duration): Inject so that number of concurrent users in the system ramps linearly from a number to another

3. Create a script that will trigger Gatling. (e.g., gatling-1-music.sh)

docker container run --detach --rm \
  -v ${PWD}/gatling/results:/opt/gatling/results \
  -v ${PWD}/gatling:/opt/gatling/user-files \
  -v ${PWD}/gatling/target:/opt/gatling/target \
  -e CLUSTER_IP=`tools/getip.sh kubectl istio-system svc/istio-ingressgateway` \
  -e USERS=1 \
  -e SIM_NAME=ReadMusicSim \
  --label gatling \
  ghcr.io/scp-2021-jan-cmpt-756/gatling:3.4.2 \
  -s proj756.ReadMusicSim

4. To list Gatling containers currently running tools/list-gatling.sh

5. To stop all the Gatling containers. tools/kill-gatling.sh

beinfluential88 commented 2 years ago

Prometheus Basics - Time Series

  1. We can query Prometheus directly without Grafana.
  2. We can output metrics to Prometheus.

Prometheus Basics – Two Fundamental Roles

  1. First, it gathers and records metrics in a time-series database (TSDB), which includes special compression techniques optimized for this type of data.
  2. Second, it supports queries against that database. It features a query language, PromQL, that meets the specific needs of time series data.

Prometheus Technical Details

  1. The set of metrics available from a given container is determined by that container, not Prometheus.
  2. The set of metrics available from our three microservices are defined by the Python client library we use, the Python Prometheus Flask exporter. We may define new metrics for our term project.

Pre-requisite

  1. AWS EKS cluster should be up and running which will serve the 4 microservices, db (database service), s1 (user service), s2 (music service), and s3 (playlist service). a. Starting a fresh EKS cluster. make -f eks.mak start

  2. Ensure AWS DynamoDB is initialized. The tables have to be available for db (database service) to serve the other 3 microservices. (s1/s2/s3) aws dynamodb list-tables

  3. Provision the cluster. This includes; a. Create a namespace within the cluster in which applications will be placed. kubectl create ns c756ns kubectl config set-context --current --namespace=c756ns b. Provision the Kubernetes cluster. This includes “Installing Istio”, “Installing Prometheus stack by calling obs.mak recursively”, and “Deploying and monitoring the four microservices”. make -f k8s.mak provision

  4. Get the Promethus URL using which we can directly run queries on Promethus. make -f k8s.mak prometheus-url

Note: The hostname is obtained from ‘istio” namespace. kubectl get -n istio-system svc/prom-ingress -o jsonpath="{.status.loadBalancer.ingress[0]['ip','hostname']}" Parameters: 1: path to kubectl 2: namespace 3: the resource to query (typically an svc)

A query returning a single time series

The following query requests the current values of all time-series that have their service label assigned the string cmpt756db. flask_http_request_total{service="cmpt756db"}

flask_http_request_total{container="cmpt756db",endpoint="http", instance="10.244.1.10:30002",job="cmpt756db",method="GET",namespace="c756ns", pod="cmpt756db-79ddc5446d-2566f",service="cmpt756db",status="200"}

Instant vector: A query returning multiple time series

Requesting any time series for our sample metric, regardless of the values for its keys. Note that the returned values were not necessarily sampled at the same time but are simply the most recent samples returned for each time series. flask_http_request_total

image

Range vector: A query returning several values from a single series

Our next query will return to the single time series but we will ask for all the samples over a given time range, returning a range vector. flask_http_request_total{service="cmpt756db"}[5m] The [5m] suffix requests all samples from the most recent 5 minutes, ordered from oldest to most recent. The entries in the Value column will now include both a count and a timestamp, separated by an @ symbol. The timestamp is in seconds since January 1, 1970, GMT. Copy one of the timestamps and paste it into the Unix epoch converter to decode the time into something more understandable.

image

Multiple range vectors

we can run a query requesting ranges for multiple time series. flask_http_request_total[5m]

image

Matching query types to vector type

The PromQL language enforces the distinction between instant and range vectors. • Aggregation operators such as avg or min can only be applied to instant vectors. • Functions that compute a value over time, such as increase or rate, can only be applied to range vectors. The list of PromQL functions specifies for each function whether a vector argument must be instant or range.

Computing a rate across a range (feat. range vectors)

The rate of HTTP calls per second (divided by number of seconds) rate(flask_http_request_total{service="cmpt756db"}) WRONG!!! rate(flask_http_request_total{service="cmpt756db"}[5m]) CORRECT!!!

Computing an average across an instant (feat. instant vectors)

the average number of HTTP requests per time series since each series began. avg(flask_http_request_total{service="cmpt756db"}[5m]) WRONG!!! avg(flask_http_request_total{service="cmpt756db"}) CORRECT!!!

bingsoorim commented 2 years ago

This is considered done. Closing the issue.