Closed cpitstick-argo closed 4 years ago
Thank you for this really detailed issue @cpitstick-argo !! We shall try this setup & get back.
Start: changes(heptio_eventrouter_normal_total{involved_object_name="chaos-engine-pod-delete",reason="ExperimentJobCreate", involved_object_namespace="hello-world-alpha"}[1m])
End: changes(heptio_eventrouter_normal_total{involved_object_name="chaos-engine-pod-delete",reason="ExperimentJobCleanUp", involved_object_namespace="hello-world-alpha"}[1m])
One other panel I neglected to show was this:
{
"datasource": "[datasource]",
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"fill": 1,
"fillGradient": 0,
"gridPos": {
"h": 6,
"w": 24,
"x": 0,
"y": 6
},
"hiddenSeries": false,
"id": 5,
"legend": {
"avg": false,
"current": false,
"max": false,
"min": false,
"show": true,
"total": false,
"values": false
},
"lines": true,
"linewidth": 1,
"nullPointMode": "null",
"options": {
"dataLinks": []
},
"percentage": false,
"pointradius": 2,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "changes(heptio_eventrouter_normal_total{involved_object_name=\"chaos-engine-pod-delete\",reason=\"ExperimentJobCreate\", involved_object_namespace=\"hello-world-alpha\"}[1m])",
"instant": false,
"legendFormat": "Start",
"refId": "A"
},
{
"expr": "changes(heptio_eventrouter_normal_total{involved_object_name=\"chaos-engine-pod-delete\",reason=\"PostChaosCheck\", involved_object_namespace=\"hello-world-alpha\"}[1m])",
"instant": false,
"legendFormat": "Stop",
"refId": "B"
}
],
"thresholds": [],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "Hello Alpha Chaos Start/Stop (Pod Delete)",
"tooltip": {
"shared": true,
"sort": 0,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
}
Thanks for the info @cpitstick-argo ! Just to summarize the discussion we had:
Updating to query to reasons ExperimentJobCreate
(start) & PostChaosCheck
(stop) with source as engine helps to stabilize the visualization. This is done in a bid to discount/remove-from-picture the variables of exp-pod-creation-time, app-status-checks and app-recovery-checks.
We will use the spec.concurrencyPolicy
set to forbid
in the cronjob (will be set as default in the upcoming schedule CR as well)
Having said that there are few things that need a deeper look:
The pod deletion events from kubelet for runner & exp pods are not consistent. Need to see if there is something we can do in our deletion api call (note that the pod-kill performed via kubectl delete
by the chaoslib always adds it) or if there is some kubernetes quirk behind this.
There is a rare case where dashboard shows two events with same reason (chaosinject) for the same timestamp. Need to check if ansible task is responsible for this or it is system/infra induced.
We will work towards obtaining similar graphs via exporter metrics over deriving from events. A first cut will be available in 1.4
@rajdas98 - it would be useful to have your inputs recorded here based on the recent chaos grafana dashboard you had written.
I'm now getting an almost perfect cadence, except for weird blips that appear every ~3 runs of the chaos experiment. These blips aren't just anomalous ExperimentJobCreate and PostChaosCheck events. Instead, the experiment seems to be running long after everything was terminated, as these blips occur in the "dead time" between runs of the experiment.
Here's the image:
I'm almost certain I have now figured out the "ghost metrics." It's not a problem with Litmus.
It's an issue with the Prometheus integration of eventrouter. I dug into the logs from Litmus as well as eventrouter, and eventrouter is (re)processing event updates that happen on a regular cadence, but the events that it is exporting on the "update" event are identical to the original event. It's a ghost because it is literally an identical repeat of a metric that has come before.
Containership maintained a fork of eventrouter once upon a time, and they actually fixed this issue in their fork:
https://github.com/containership/eventrouter/commit/f62fe77a43bf06fd846acb73ee4d473186a8ef5b
I haven't yet been able to test this, I'll get to that next week; still, my certainty that this is one of the main reasons I saw these issues in the first place is very high.
Several possible actions stem from this:
Great! Thanks for the update @cpitstick-argo ! Option (3) is what we are working towards in the near term. I suppose the next step (should share this shortly) is identical set of dashboards with chaos exporter metrics. Having said that, we are finding the event routers to be pretty nifty wrt how we treat & persist events - so will circle around on that at some point.
I just finished testing this and can confirm it worked. Applied the patch a local fork of the event router and used it. No more ghost metrics!
I also was able to go back to using ChaosEngineInitialized
and ChaosEngineCompleted
{
"datasource": "[Source]",
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"fill": 1,
"fillGradient": 0,
"gridPos": {
"h": 6,
"w": 24,
"x": 0,
"y": 0
},
"hiddenSeries": false,
"id": 8,
"legend": {
"avg": false,
"current": false,
"max": false,
"min": false,
"show": true,
"total": false,
"values": false
},
"lines": true,
"linewidth": 1,
"nullPointMode": "null",
"options": {
"dataLinks": []
},
"percentage": false,
"pointradius": 2,
"points": false,
"renderer": "flot",
"seriesOverrides": [
{
"alias": "Service Availability",
"color": "#F2CC0C"
},
{
"alias": "Chaos Start",
"bars": true,
"color": "#56A64B"
},
{
"alias": "Chaos End",
"bars": true,
"color": "#E02F44"
},
{
"alias": "Chaos Inject (Pod Delete)",
"color": "#3274D9"
}
],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "{__name__=\"probe_success\",job=\"hello-world-alpha-prometheus-blackbox-exporter\"}",
"legendFormat": "Service Availability",
"refId": "A"
},
{
"expr": "changes(heptio_eventrouter_normal_total{reason=\"ChaosInject\", involved_object_name=\"chaos-engine-pod-delete\", involved_object_namespace=\"hello-world-alpha\", involved_object_kind=\"ChaosEngine\"}[1m])",
"format": "time_series",
"legendFormat": "Chaos Inject (Pod Delete)",
"refId": "B"
},
{
"expr": "(changes(heptio_eventrouter_normal_total{involved_object_name=\"chaos-engine-pod-delete\",reason=\"ChaosEngineInitialized\", involved_object_namespace=\"hello-world-alpha\"}[30s]) > 0) * 4",
"legendFormat": "Chaos Start",
"refId": "C"
},
{
"expr": "(changes(heptio_eventrouter_normal_total{involved_object_name=\"chaos-engine-pod-delete\",reason=\"ChaosEngineCompleted\", involved_object_namespace=\"hello-world-alpha\"}[30s]) > 0) * 4",
"legendFormat": "Chaos End",
"refId": "D"
}
],
"thresholds": [],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "Hello World Alpha Chaos (Pod Delete)",
"tooltip": {
"shared": true,
"sort": 0,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"format": "short",
"label": "Frequency",
"logBase": 1,
"max": 4,
"min": 0,
"show": true
},
{
"format": "short",
"label": null,
"logBase": 1,
"max": 4,
"min": 0,
"show": true
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
}
I've thought a little more about this, and I think that Litmus' strategy of exporting notifications by events is a sound one. Just take a look at all the places this (https://github.com/opsgenie/kubernetes-event-exporter) exports to, with many more planned. Litmus cannot and should not try to solve that problem, it's huge and never-ending.
Instead, I do think the way things are is better. You export events (with tunable verbosity and maybe some other knobs and dials), and then provide links to supported collaborative software like opsgenie or eventrouter that will send the events to the places where they're needed for solid monitoring.
Maybe a default prometheus exporter is fine, or a limited set of default exporters as a way to show that monitoring is possible. Beyond that, the Litmus framework shouldn't bite off more than it can chew here.
@cpitstick-argo hi! I'm trying to do the same thing as you. Could you clarify how you achieved to see Kubernetes events in Grafana?
What else do you need?
I've achieved approximately the same result as you.
Going to take a look!
This issue was originally created to track inconsistencies in the events indicating start/stop of chaos, which was eventually fixed via a stable fork of the original integration/solution: heptio event router. Refer https://github.com/litmuschaos/litmus/issues/1472#issuecomment-619462834.
The current thought process is to continue building in more events and experiment metrics to enable existing tools in the ecosystem to consume this for custom visualization. Litmus also plans to provide more dashboard examples for standard/demo applications which the community can modify to suite their own needs, while serving as examples.
Closing this issue based on the above.
What happened: I'm trying to create a Grafana dashboard such that Litmus Kubernetes events are exported to Prometheus using heptio eventrouter (manifest & service monitor below), monitoring an example hello-world service (manifest below). The idea here is to get a very clear dashboard that shows when chaos runs and the impact on the availability of the service.
This generally works by watching ChaosInject, but I have not be able to see a stable "start/stop" indicator of the chaos experiment. Right now I'm using these promqls as annotations on the Grafana panel:
Start:
changes(heptio_eventrouter_normal_total{involved_object_name="chaos-engine-pod-delete-runner",reason="Started", involved_object_namespace="hello-world-alpha"}[1m])
End:
changes(heptio_eventrouter_normal_total{involved_object_name="chaos-engine-pod-delete-runner",reason="Killing", involved_object_namespace="hello-world-alpha"}[1m])
The
start
event seems to spawn pretty consistently, but the 'end' event ('Killing') appears inconsistently and intermittently.What you expected to happen:
Ideally, here is what I would want to see (this is an example of when it was working):
Green lines (annotations) are chaos start, red lines (annotations) are chaos stop, green time-series metric is the blackbox monitor ping, and yellow is the ChaosInject. You can see that while chaos is running, the hello world service is down, but as soon as chaos stops, it comes back until such time as the CronJob invokes it again.
Of note: "ChaosEngineInitialized", "ChaosEngineCompleted", "ChaosEngineStopped" seem to be far less reliable in this regard, the correlation with the ability to create a graph such as the above is much weaker. I do not understand what these metrics mean and how they are supposed to be used.
How to reproduce it (as minimally and precisely as possible):
I set this up on an EKS Kubernetes cluster running 1.15.10 that has the prometheus operator (5.2.0) exporting to Grafana (6.5.2).
Commands used to set this up: Install blackbox exporter:
helm install prometheus-blackbox-exporter stable/prometheus-blackbox-exporter
Grafana Panel JSON:
Relevant Kubernetes manifests:
Service under test:
Pod Delete CronJob manifest:
Blackbox Monitor Manifest:
Eventrouter manifest: