Can't get example dashboards to fully function

sarg3nt commented 1 year ago

I've been strugeling for a couple of days to get the provided Grafana dashboard dashboard-results.json to function fully. After reading through the closed issues I've finally discovered I need to pass K6_PROMETHEUS_RW_TREND_AS_NATIVE_HISTOGRAM=true to k6 but that hasn't solved it for me. I do have the correct version of Prometheus and have --enable-feature=native-histograms set. I'm receiving no errors in the Grafana dashboard now so I think everything is set up as it should be.

We are running Grafana 9.3.6 and Prometheus 2.42.0

Here's my thoughts / questions in a nice list:

The README.md should be updated with better instructions on what all needs to be set up or at minimum a link to https://k6.io/docs/results-output/real-time/prometheus-remote-write/#about-metrics-mapping
What is testid? Nothing ever appears in that list and I can't find it anywhere in the k6 docs. All the queries seem to rely on it but I don't know what it is or where it comes from or if there is something I should be setting.
P95 Response time is always 0
Selecting a URL in the URL drop down does not seem to do anything, i.e. data for unselected urls still show on the graphs
My main graph only has "Active VUs" and "Request Rate" whereas your screenshots also have "Failed request rate" and "Response Time" I could see failed request rate not showing as I have no failed requests, but response time is kind of important.
What am I missing?

See below for a screenshot of my dashboard

Allaman commented 1 year ago

Hi @sarg3nt, I can answer your second question as I stumbled over this one, too. testid is just a (arbitrary) tag. Have a look at the docker-script

Edit: just stumbled over the explanation in the README :)

codebien commented 1 year ago

Hey @sarg3nt, thanks for the feedback, we will use it for future improvements to the documentation.

@jwcastillo can you help with points 3 and 4, please?

jwcastillo commented 1 year ago

yes, I will check it

sarg3nt commented 1 year ago

@Allaman which README did you find that in? I'm not seeing it . . Oh, the root README.md. My bad. I was combing over the dashboards README.md. Since this is built into k6 now I barely glanced at the root README.md

sarg3nt commented 1 year ago

Does anyone know if the testid needs to be globally unique for every run or just unique for the specific .js file.
I.e. is the testid + time range good enough?

codebien commented 1 year ago

was combing over the dashboards README.md

@sarg3nt How did you find it? We may need to update a link.

Does anyone know if the testid needs to be globally unique for every run or just unique for the specific .js file.

You can use the docker-run.sh for convenience as the README suggests.

sarg3nt commented 1 year ago

@codebien I searched around until I found it in the root https://github.com/grafana/xk6-output-prometheus-remote/blob/main/README.md for the project. I ignored it earlier because we don't use Docker compose. Our prom / grafana are already deployed to internal k8s clusters.

Checked out the script, I basically just finished writing the same thing. :)

I got the testid working but still no P95 response times or Response time in the main graph

codebien commented 1 year ago

Hey @sarg3nt, @jwcastillo submitted a fix in #113. Can you check out the branch and see if it resolves your issues, please?

manubell commented 1 year ago

Trying these dashboards as well (also tested the dashboards commited in #113) and the P95 stat doesn't seem to work for me either. I don't think it's related to the dashboards either, it seems Grafana is not showing any data for the native histogram metrics.

In Prometheus I have the following features enabled: remote-write-receiver, native-histograms When I look in Prometheus the time series seem to be correctly saved in the newer format.

But then in Grafana when I go to discover and search for histogram_quantile(0.95, rate(k6_http_req_duration_seconds[1m])) I don't get any results.

Is there perhaps a setting for Grafana that sarg3nt and I overlooked and need for this to be functional?

manubell commented 1 year ago

Update;

As I am playing with it a bit more it seems to be that having rate in the query doesn't seem to be working with histograms. This is my first encounter with the new histogram format so I don't know much about them yet.

Below were 2x 1minute k6 runs. (didn't have a testid specified either)

Without rate I get some stats:

With rate I get nothing at all.

codebien commented 1 year ago

Hi @manubell, which versions are you using? Grafana 9.3.6 and Prometheus 2.42.0?

The sum function should be part of the second argument:

histogram_quantile(0.9, sum by (testid) (rate(k6_http_req_duration_seconds[1m])))

sarg3nt commented 1 year ago

I can see the same result in Proemtheus for k6_http_req_duration_seconds that @manubell shows but when I try to replicate the grapgh in Grafana with sum by(testid) (histogram_quantile(0.95, k6_http_req_duration_seconds)) I'm still not seeing anything.

codebien commented 1 year ago

Hey @sarg3nt, can you report an anonymized k6 script and the exact commands you're running, please? In this way, we should be able to reproduce the issue.

Do you open the Test Result dashboard following the link from the Test List dashboard?

sarg3nt commented 1 year ago

@codebien please see files attached. I'm not including the login.js file as I doubt it is useful to you, let me know if you think you need it.

run_loadtest.sh This script takes input and generates a k6 command that would look something like this:
K6_PROMETHEUS_RW_TREND_AS_NATIVE_HISTOGRAM=true K6_PROMETHEUS_RW_SERVER_URL=https://<MY-PROMETHEUS-URL>/api/v1/write k6 run -e LOADTEST_REALM=<REALM> -e LOADTEST_BASE_URL=<BASE-URL-HERE> -e TESTID=load_bulletins 03/10/23 12:06:23 --tag testid=load_bulletins 03/10/23 12:06:23 -o experimental-prometheus-rw ./load_bulletins.js

#!/usr/bin/bash
set -euo pipefail
IFS=$'\n\t'

#<Colors>
red="\033[1;31m"
yellow="\033[1;33m"
green="\033[1;32m"
blue="\033[1;34m"
nc="\033[0m"
#</Colors>

show_help () {
    echo -e "${green}== k6 LoadTester ==${nc}"
    echo -e "The run_loadtest.sh script is a wrapper around the k6 loadtesting utility."
    echo -e "  Note: This tool defaults to using the local test account and does not support SSO."
    echo
    echo -e "${blue}List available scripts: ${nc}"
    echo -e "  ./run_loadtest.sh ls"
    echo
    echo -e "${blue}Show k6 run help: ${nc}"
    echo -e "  ./run_loadtest.sh <script> -h"
    echo
    echo -e "${blue}Run loadtest with script on local build: ${nc}"
    echo -e "  ./run_loadtest.sh [script] [optional=realm]"
    echo
    echo -e "${blue}Run loadtest with script on remote deployment: ${nc}"
    echo -e "  ./run_loadtest.sh [script] [optional=realm] [optional=baseUrl]"
    echo ""
    echo "The following flags can be passed to k6"
    cat << EOF
 Flags:
  -o  --out cloud           send test results to k6 cloud, you must be logged into a cloud account
  -o, --out uri             uri for an external metrics database
  -a, --address string      address for the REST API server (default "localhost:6565")
  -c, --config string       JSON config file (default "/home/vscode/.config/loadimpact/k6/config.json")
  -h, --help                help for k6
      --log-format string   log output format
      --log-output string   change the output for k6 logs, possible values are stderr,stdout,none,loki[=host:port],file[=./path.fileformat] (default "stderr")
      --no-color            disable colored output
  -q, --quiet               disable progress updates
  -v, --verbose             enable verbose logging
EOF
}

validateTools() {
  toolsInstalled=true

  # verify k6 is installed
  k6Tool=$(which k6)
  if [ -z "${k6Tool}" ]; then
      echo -e "${red}Error: You do not currently have k6 installed.${nc}"
      toolsInstalled=false
  fi

  if [ "${toolsInstalled}" == false ]; then
    echo ""
    echo -e "${red}You are missing the required tools to proceed${nc}"
    echo ""
    exit 1
  fi
}

main () {
    validateTools

    # Set up local vars
    local script_dir=""
    local root_path=""
    script_dir=$(cd "$(dirname "${BASH_SOURCE[0]}")" &>/dev/null && pwd -P)
    root_path=$(realpath "${script_dir}/../..")
    local loadtest_path="${root_path}/deployments/validators/loadtest"
    local test=""
    local realm="saas"
    local baseUrl="${realm}.localhost"
    local options=()
    local testid=""

    # Display help menu
    if [[ "${#}" == 0 || "${1}" == "--help" || "${1}" == "-h" ]]; then
        show_help
        exit 0
    fi

    # Load the test script from position 1
    # We know there is a first element, otherwise the above if would have triggered and we would have shown the help
    test="$1"
    # Loop through remaining parameters and process them looking for switches as needed.
    for (( i=2; i <= "$#"; i++ )); do
        # Check if passed arg has a - or -- in it, if so, add it directly to the options array and continue.
        if [[ "${!i}" == -* || "${!i}" == --* ]]; then
            options=("${options[@]}" "${!i}")
            # Look ahead on item in the array if there is one and check if it starts with a - or --
            # If it does not, assume it is an argument to the current switch.
            # Example: '-o cloud' where above if added the -o and below if adds 'cloud'
            ((i=i+1))
            if (( i <= "$#")) && [[ "${!i}" != -* && "${!i}" != --*  ]]; then
                options=("${options[@]}" "${!i}")
                continue
            fi

            # The inner if was not true so we need to decrement the counter and continue.
            ((i=i-1))
            continue
        fi

        # If this is the second item in the param list then we add it as it is the test script name
        if (( i == 2 )); then
            realm="${!i}" && continue
        fi

        # If this is the second item in the param list then we add it as it is the realm
        if (( i == 3 )); then
            baseUrl="${!i}" && continue
        fi
    done

    # Change into working directory
    cd "${loadtest_path}" || exit 1

    # Provide a list of scripts to run
    if [[ "${test}" == "list" || "${test}" == "ls"  ]]; then
        echo -e "${green}Current Load test scripts:${nc}"

        # List files in the loadtest directory
        for file in "./"*; do
            # skip login script
            if [[ "${file:2:-3}" == 'login' || "${file:2:-3}" == 'init' || "${file:2:-3}" == *'template' ]]; then
                continue
            fi

            echo "  ${file:2:-3}"
        done
        exit 0
    fi

    # Unable to load path
    if [[ ! -f "${loadtest_path}/${test}.js" ]]; then
        echo -e "${yellow}Unable to load${nc}" \
                "${green} ${test} ${nc}"
        echo -e "${yellow}Please validate you selected the right script.${nc}"
        exit 1
    fi

    # Build the testid 
    testid="${test} $(date '+%D %T')"

    # Save for later as I might use this to add some values to the grafana graph later
    #K6_PROMETHEUS_RW_TREND_STATS='avg,p(90),p(99),min,max'\

    # Output the k6 command we are going to run
    echo K6_PROMETHEUS_RW_TREND_AS_NATIVE_HISTOGRAM=true \
    K6_PROMETHEUS_RW_SERVER_URL="https://<MY-PROMETHEUS-URL>/api/v1/write" \
    k6 run\
    -e LOADTEST_REALM="${realm}"\
    -e LOADTEST_BASE_URL="${baseUrl}"\
    -e TESTID="${testid}"\
    --tag testid="${testid}"\
    -o experimental-prometheus-rw\
    "./${test}.js" "${options[@]}"

    # Run k6
    # See: https://k6.io/docs/results-output/real-time/grafana-cloud/
    # and  https://k6.io/docs/results-output/real-time/prometheus-remote-write/

    K6_PROMETHEUS_RW_TREND_AS_NATIVE_HISTOGRAM=true \
    K6_PROMETHEUS_RW_SERVER_URL="https://<MY-PROMETHEUS-URL>/api/v1/write" \
    k6 run\
    -e LOADTEST_REALM="${realm}"\
    -e LOADTEST_BASE_URL="${baseUrl}"\
    -e TESTID="${testid}"\
    --tag testid="${testid}"\
    -o experimental-prometheus-rw\
    "./${test}.js" "${options[@]}"
}

# Lets begin!
if ! (return 0 2> /dev/null); then
  (main "$@")
fi

init.js contains default values for number of VUs, checks, http_req_failed, etc. along with values needed for authentication and auth caching. The main purpose of this file is an attempt to keep the test scripts as DRY as we can and to codify a standard of expectations for our APIs .

// This file holds common initialization functionality that all tests will use
// including logging in

export function init() {
  let lt_settings = {
    users: {
      normal: 50,
      max: 100,
      //TODO: Determine what breaking needs to be
      breaking: 250,
    },
    // Will fail the test if the checks defined in the test do not succeed greater than the given percentage of time
    checks: ["rate>0.9"],
    http_req_failed: [{ threshold: "rate < 0.01", abortOnFail: true }],
    http_req_blocked: [{ threshold: "max < 2000", abortOnFail: false }],
    http_req_duration: ["p(95) <= 100", "p(99.9) < 1000"],
    realm: __ENV.LOADTEST_REALM,
    base_url: __ENV.LOADTEST_BASE_URL,
    // Should we cache the login token or not.  Default is true, false will cause a new login for each test.
    cache_token: __ENV.LOADTEST_CACHE_LOGIN_TOKEN,
    access_token: "",
    expires_on: 0,
  };

  // Add local login base URL if one was not passed.
  // Note: this is here as it's the best place for init code like this at the moment
  if (
    lt_settings.base_url === "isundefined" ||
    lt_settings.base_url === undefined ||
    lt_settings.base_url === ""
  ) {
    lt_settings.base_url = `${lt_settings.realm}.localhost`;
  }

  if (
    lt_settings.cache_token === "isundefined" ||
    lt_settings.cache_token === undefined ||
    lt_settings.cache_token === ""
  ) {
    lt_settings.cache_token = true;
  }

  return lt_settings;
}

load_bulletins.js The k6 test script itself. Note that we are importing login.js but I have not included that file.

import http from "k6/http";
import { login } from "./login.js";
import { init } from "./init.js";
import { sleep, check } from "k6";

let lt_settings = init();

export const options = {
  insecureSkipTLSVerify: true,
  stages: [
    { duration: "30s", target: lt_settings.users.normal }, // 5m: Simulate ramp-up of traffic from 1 to normal load over 5 minutes
    { duration: "1m", target: lt_settings.users.normal }, // 10m: Stay at the normal load for 10 minutes
    { duration: "30s", target: 0 }, // 5m: Ramp-down to 0 users over 5 minutes.
  ],
  thresholds: {
    checks: lt_settings.checks,
    http_req_failed: lt_settings.http_req_failed,
    http_req_blocked: lt_settings.http_req_blocked,
    http_req_duration: lt_settings.http_req_duration,
  },
};

export function setup() {
  // Set the start time in milliseconds to 15 seconds before actual start time to give some padding left on the graph
  let start_time = new Date().getTime() - 15000;
  return start_time;
}

export default function () {
  lt_settings = login({ lt_settings });
  const result = http.get(`https://<SOME-URL-TO-TEST>`, {
    headers: {
      "Content-Type": "application/json",
      Authorization: `Bearer ${lt_settings.access_token}`,
      tenant: lt_settings.realm,
    },
  });

  check(result, {
    "status is 200": (r) => r.status === 200,
    "has data": (r) => JSON.parse(r.body).length > 0,
    "api response text contains <SOME VALUE>": (r) => r.body.includes("<SOME VALUE>"),
  });

  sleep(1); // How many requests per second, scale by adding users
}

export function teardown(start_time) {
  // Set the end_time to actual end time + 90 seconds to give some padding right on the graph and
  // account for in flight requests to finish
  let end_time = Date.now() + 90000;
  let testid = encodeURIComponent(__ENV.TESTID);

  // Create a link for the user to click on to make viewing test results easy.
  console.log(
    `Test results can be viewed in Grafana here: https://<URL TO GRAFANA>/dashboard/d/01npcT44k/test-result?from=${start_time}&to=${end_time}&var-DS_PROMETHEUS=Prometheus-thanos&var-testid=${testid}&var-scenario=All&var-url=All&var-metrics=k6_http_req_waiting_seconds`
  );
}

codebien commented 1 year ago

Hey @sarg3nt, I don't see any related issue skimming your code, and running it I got the expected value for the end-of-test summary.

Versions used:

k6 v0.43.1 (2023-02-27T10:53:03+0000/v0.43.1-0-gaf3a0b89, go1.19.6, linux/amd64)
grafana 9.3.6
prometheus v2.42.0

Do you open the Test Result dashboard following the link from the Test List dashboard?

You haven't answered my question, this still sounds like the reason. If you open the Test Result dashboard directly then you will probably get the wrong time frame and the query will return a wrong number.

Two more checks to do:

Do you see the metric k6_http_req_duration_seconds from Grafana's Metric explorer?
Did you try to graph the metric using the Grafana Explorer or the Prometheus web app?

sarg3nt commented 1 year ago

@codebien sorry for missing the result dashboard question. Yes, I have tried from there as well with no luck. I can see k6_http_req_duration_seconds data in Prometheus and can see k6_http_req_duration_seconds as a value in the Grafana explorer but it never returns data. Please keep in mind I'm very new the Prometheus / Grafana stack.

Here's output in Prometheus:

But no output in Grafana:

Are there some logs or setup I can get you to help figure this out?

sarg3nt commented 1 year ago

@codebien We got it fixed (mostly). Our tech that manages our Prometheus / Grafana deployment has found there is a disconnect happening between Grafana and Thanos where Thanos was not returning data for k6_http_req_duration_seconds. He's looking into that now but in the meantime gave us a datasource directly to Prometheus. I can now see the P95 response times as expected using https://github.com/grafana/xk6-output-prometheus-remote/pull/113

A few oddities to mention:

We are seeing two response time graphs now and I think it might be because we have two URLs being called. One for login and the other the actual API. I don't know how to distinguish between the two in the line graph / make them different colors, show the URL, etc, if that is even the issue.
Changing the DS_PROMETHEUS drop down does nothing as the datasource seems to get baked into the graphs. This really confused me and made me think it was still broken for a while there.
The "Active VUs" Graph line starts at 7 for some reason. The whole left side of the graph looks chopped off
Changing the metrics drop down doesn't seem to do anything. IDK, maybe it does, I'm not seeing any difference.
Changing the URL does not seem to do anything
Looks like the Thresholds section is still work in progress, any idea when this will get done or should I try and figure it out on my own?

The above screen cap was from the link I create in the test, when I use the Test List it does basically the same thing but with a wider time span. Graphs are still cut off to the left

jwcastillo commented 1 year ago

I understand the issue with the two response time graphs. I will work on fixing it so that you can distinguish between the two URLs in the line chart.
Regarding the DS_PROMETHEUS, you can hide this variable in the configuration section.
Changing the metrics dropdown affects how percentiles and similar metrics are calculated. It's possible that the results of the test you ran are so similar that you don't notice the difference.
The option to change the URL affects the corresponding URL section in the dashboard.

As for points 3 and 6, as well as the behavior of the graph appearing cut off on the left, we will wait for help from @codebien for a more precise answer. In the meantime, please feel free to communicate any other concerns or issues.

sarg3nt commented 1 year ago

@jwcastillo and @codebien I've got some more oddities to report. The Requests by URL table seems to be reporting incorrect data. Here's the output from k6 for the test in question.

Command line to run this test using 'k6':
K6_PROMETHEUS_RW_TREND_AS_NATIVE_HISTOGRAM=true K6_PROMETHEUS_RW_SERVER_URL=https://<redacted>/api/v1/write k6 run -e LOADTEST_REALM=<redacted> -e LOADTEST_BASE_URL="<redacted>" -e LOADTEST_TYPE=load -e LOADTEST_SUB_URL="api/bulletins/bulletins?serialNumber=" -e LOADTEST_CONTENT_TYPE="application/json; charset=utf-8" -e LOADTEST_EXPECTED_TEXT="2000.13" -e TESTID="Bulletins List 03/29/23 14:15:45" --tag testid="Bulletins List 03/29/23 14:15:45" -o experimental-prometheus-rw ./api.js

          /\      |‾‾| /‾‾/   /‾‾/
     /\  /  \     |  |/  /   /  /
    /  \/    \    |     (   /   ‾‾\
   /          \   |  |\  \ |  (‾)  |
  / __________ \  |__| \__\ \_____/ .io

  execution: local
     script: ./api.js
     output: Prometheus remote write (<redacted>)

  scenarios: (100.00%) 1 scenario, 50 max VUs, 20m30s max duration (incl. graceful stop):
           * default: Up to 50 looping VUs for 20m0s over 3 stages (gracefulRampDown: 30s, gracefulStop: 30s)

INFO[0000] Performaing a load test                       source=console
INFO[1202] Grafana: https://<redacted>/dashboard/d/01npcT44k/test-result?from=1680124530752&to=1680125837607&var-testid=Bulletins%20List%2003%2F29%2F23%2014%3A15%3A45&var-scenario=All&var-url=All&var-metrics=k6_http_req_duration_seconds  source=console

running (20m01.9s), 00/50 VUs, 3189 complete and 0 interrupted iterations
default ✓ [======================================] 00/50 VUs  20m0s

     ✓ Status is 200
     ✓ Has correct content type
     ✓ Has expected value

     █ setup

     █ teardown

   ✓ checks.........................: 100.00% ✓ 9567     ✗ 0
     data_received..................: 1.3 GB  1.0 MB/s
     data_sent......................: 4.6 MB  3.8 kB/s
   ✓ http_req_blocked...............: avg=3.42ms   min=170ns   med=210ns    max=1.2s     p(90)=300ns    p(95)=370ns
     http_req_connecting............: avg=876.45µs min=0s      med=0s       max=1.05s    p(90)=0s       p(95)=0s
   ✗ http_req_duration..............: avg=12.55s   min=99.62ms med=13.33s   max=29.8s    p(90)=22.02s   p(95)=23.6s
       { expected_response:true }...: avg=12.55s   min=99.62ms med=13.33s   max=29.8s    p(90)=22.02s   p(95)=23.6s
   ✓ http_req_failed................: 0.00%   ✓ 0        ✗ 3360
     http_req_receiving.............: avg=371.11ms min=39.57µs med=297.42ms max=4.81s    p(90)=525.64ms p(95)=718.89ms
     http_req_sending...............: avg=59.5µs   min=24.37µs med=58.43µs  max=184.42µs p(90)=75.08µs  p(95)=81.05µs
     http_req_tls_handshaking.......: avg=1.92ms   min=0s      med=0s       max=1.12s    p(90)=0s       p(95)=0s
     http_req_waiting...............: avg=12.18s   min=96.49ms med=12.95s   max=29.46s   p(90)=21.66s   p(95)=23.11s
     http_reqs......................: 3360    2.795675/s
     iteration_duration.............: avg=14.22s   min=67.65µs med=15.15s   max=30.8s    p(90)=23.15s   p(95)=24.69s
     iterations.....................: 3189    2.653395/s
     vus............................: 1       min=1      max=50
     vus_max........................: 50      min=50     max=50

ERRO[1202] some thresholds have failed

However the "Requests by URL" table reports the following: The top URL is the primary with the second being auth, sorry for the redacted info. None of the other times match on the main graph and P95 output as well. k6 output says P95 was 23.6 where Grafana says 26.3 The above screenshots are from the link I generate in the k6 output, if I use the link in the "Test List" I get the same results just zoomed out somewhat on the graph

notxcain commented 1 year ago

Hello! Glad I've found this thread. Do you know why the Active VU is shorter than request rate and response time?

notxcain commented 1 year ago

Hey @jwcastillo, sorry for nudging you. Please tell me if there's a better place to ask questions.

notxcain commented 1 year ago

Okay. I think I got it. It corresponds to the range I use in the rate function.

ppcano commented 1 year ago

Hey there! 🌟 Good news! We have updated the previous dashboard and created a new dashboard that supports the option without histogram metrics. You can check them out in the dashboard directory or with the docker-compose example of this repo:

Both dashboards share the same design. The only differences are in the PromQL queries and Trend Metric Query variable.

I encourage you to start using them. If you stumble on any issues or have any suggestions—let us know! Gonna close this issue now since it’s about the previous version.

grafana / xk6-output-prometheus-remote

Can't get example dashboards to fully function #109