litmuschaos / chaos-exporter

Prometheus Exporter for Litmus Chaos Metrics
Apache License 2.0
35 stars 46 forks source link

Litmus exporter not exporting metrics #141

Open anuraagrijal3138 opened 1 year ago

anuraagrijal3138 commented 1 year ago

BUG REPORT

What happened: I am using the litmus-exporter image litmuschaos/chaos-exporter:3.0.0-beta5. When running a chaos experiment in GKE, there is a gap in metrics sent by the exporter. Sometimes the metrics are exported, while other times the exporter does not even detect the metrics.

Logs not showing any metrics after run % kl litmus chaos-monitor-788f87f99-4vqdl
W0726 11:39:16.209768 1 client_config.go:615] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
time="2023-07-26T11:39:16Z" level=info msg="Started creating Metrics"
time="2023-07-26T11:39:16Z" level=info msg="Registering Fixed Metrics"
time="2023-07-26T11:39:16Z" level=info msg="Beginning to serve on port :8080"
time="2023-07-26T11:41:26Z" level=info msg="[Wait]: Hold on, no active chaosengine found ... "
Logs showing metrics after run time="2023-07-26T11:39:20Z" level=info msg="Beginning to serve on port :8080"
time="2023-07-26T11:39:20Z" level=info msg="Started creating Metrics"
time="2023-07-26T11:39:20Z" level=info msg="Registering Fixed Metrics"
time="2023-07-26T11:41:51Z" level=info msg="[Wait]: Hold on, no active chaosengine found ... "
time="2023-07-26T12:20:39Z" level=info msg="The chaos metrics are as follows" ProbeSuccessPercentage=0 EndTime=0 FaultName=pod-delete PassedExperiments=0 FailedExperiments=0 AwaitedExperiments=1 StartTime=1.69037394e+09 ChaosInjectTime=1690374019 TotalDuration=0 ResultName=ambassador-pod-delete-1690373933-pod-delete ResultNamespace=litmus ResultVerdict=Awaited
The run is successful in both case % kl litmus pod-delete-xxq0f6-fpvqg
...
time="2023-07-26T11:52:07Z" level=info msg="[Status]: The status of Pods are as follows" Pod=pay-dummy-dev-676577958d-jjlnh Status=Running
time="2023-07-26T11:52:11Z" level=info msg="[Probe]: check-http-probe-success probe has been Passed 😄 " ProbeName=check-http-probe-success ProbeType=httpProbe ProbeInstance=PostChaos ProbeStatus=Passed time="2023-07-26T11:52:11Z" level=info msg="[The End]: Updating the chaos result of pod-delete experiment (EOT)"
The chaosResult is present as well ``` % kubectl describe -n litmus chaosresult {service-name}-pod-delete-1690372234-pod-delete
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Awaited 47m pod-delete-xxq0f6-fpvqg experiment: pod-delete, Result: Awaited
Normal Pass 46m pod-delete-xxq0f6-fpvqg experiment: pod-delete, Result: Pass ```

What you expected to happen: Metrics from all the chaos engine is exported. How to reproduce it (as minimally and precisely as possible): It happens often. We have 4 GKE clusters where we are running the experiments as schedules.

Yaml manifest used to create the experiment apiVersion: litmuschaos.io/v1alpha1
kind: ChaosSchedule
metadata:
namespace: litmus
name: "{service name}-pod-delete"
labels:
app: {service name}-chaos
spec:
schedule:
now: true
engineTemplateSpec:
appinfo:
appns: '{namespace}'
applabel: 'app.kubernetes.io/instance={serviceName}'
appkind: 'deployment'
engineState: 'active'
chaosServiceAccount: litmus-runner
jobCleanUpPolicy: "delete"
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: '60'
- name: CHAOS_INTERVAL
value: '10'
- name: FORCE
value: 'true'
- name: PODS_AFFECTED_PERC
value: '70'
probe:
- name: 'check-http-probe-success'
type: 'httpProbe'
httpProbe/inputs:
url: "http://{servicename.namespace}.svc.cluster.local/"
insecureSkipVerify: true
responseTimeout: 1000
method:
get:
criteria: "=="
responseCode: '200'
mode: "Continuous"
runProperties:
probeTimeout: 10000
interval: 5000
retry: 2
probePollingInterval: 5000

Anything else we need to know?:
We usually schedule on repeat. To debug the issue, we are running the schedule as now: true.
We have installed litmus-core and kubernetes-chaos version 2.14.0.
For the chaos-exporter deployment, we are using the image version 3.0.0-beta5, mainly to get the fault_name label in the metrics.
The resource(memory/CPU) for the litmus-exporter pod is adequate. Less than 50% of resource/request is being used.

Thank you!