runwhen-contrib / rw-public-codecollection

RunWhen Public Codecollection Repository - Open Source troubleshooting runbook library for Kubernetes and cloud infrastructure components.
Apache License 2.0
39 stars 5 forks source link

RunWhen Public Codecollection

This repository is the primary public codecollection that is to be used within the RunWhen platform. It contains codebundles that can be used in SLIs, SLOs, and TaskSets.

Please see the contributing and code of conduct for details on adding your contributions to this project.

Documentation for each codebundle is maintained in the README.md alongside the robot code and is published at https://docs.runwhen.com/public/v/codebundles/. Please see the readme howto for details on crafting a codebundle readme that can be indexed.

Codebundle Index

Name Supported Integrations Tasks Documentation
Kubernetes Namespace Healthcheck Kubernetes, AKS, EKS, GKE, OpenShift Get Event Count and Score, Get Container Restarts and Score, Get NotReady Pods, Generate Namspace Score This SLI uses kubectl to score namespace health. Produces a value between 0 (completely failing thet test) and 1 (fully passing the test). Looks for container restarts, events, and pods not ready. Docs
Kubernetes Namespace Troubleshoot Kubernetes, AKS, EKS, GKE, OpenShift Trace Namespace Errors, Fetch Unready Pods, Triage Namespace, Object Condition Check, Namespace Get All This taskset runs general troubleshooting checks against all applicable objects in a namespace, checks error events, and searches pod logs for error entries. Docs
Kubernetes Run Shell Command Kubernetes, AKS, EKS, GKE, OpenShift Running Kubectl And Adding Stdout To Report This codebundle runs an arbitrary kubectl command and writes the stdout to a report. Typically used in conjunction with other codebundles. Docs
Kubernetes Synthetic PVC Test Kubernetes, AKS, EKS, GKE, OpenShift Run Canary Job Creates an adhoc one-shot job which mounts a PVC as a canary test, which is polled for success before being torn down. Docs
Kubernetes Workload Metric Kubernetes, AKS, EKS, GKE, OpenShift Running Kubectl get and push the metric This codebundle runs a kubectl get command that produces a value and pushes the metric. Uses jmespath for filtering and allows calculations such as count, sum, avg on specified fields. Docs
argocd-healthcheck-sli argocd ArgoCD Health Check Check the health of ArgoCD platfrom by checking the availability of its underlying Deployments and StatefulSets. Docs
artifactory-ok-sli artifactory Check If Artifactory Endpoint Is Healthy Checks an Artifactory instance health endpoint to determine its operational status. The response is parsed to determine if the service is healthy, resulting in a metric of 1 if it is, or 0 if not. Docs
aws-account-limit-sli aws Get Count Of AWS Accounts In Organization Retrieve the count of all AWS accounts in an organization. Docs
aws-account-limit-taskset aws, iam Get The Recently Created AWS Accounts Retrieve all recently created AWS accounts. Docs
aws-billing-costsacrosstags-taskset aws, billing, costexplorer Get All Billing Sliced By Tags Creates a report of AWS line item costs filtered to a list of tagged resources Docs
aws-billing-tagcosts-sli aws, billing, costexplorer Get All Billing Sliced By Tags Monitors AWS cost and usage data for the latest billing period. Accepts one tag for continuous monitoring. Docs
aws-cloudformation-stackevents-count-sli aws, cloudformation Fetch CloudFormation Stack Events Retrieve the number of detected AWS CloudFormation stack events over a given history Docs
aws-cloudformation-triage-taskset aws, cloudformation Get All Recent Stack Events Triage and troubleshoot various issues with AWS CloudFormation Docs
aws-cloudwatch-logquery-rowcount-zeroerror-sli aws, cloudwatch Running CloudWatch Log Query And Pushing 1 If No Results Found Retrieve binary result from an AWS CloudWatch Insights query. Pushes 0 (success) if logs are found (activity) or 1 if no logs were found in the time window. Docs
aws-cloudwatch-logquery-sli aws, cloudwatch Running CloudWatch Log Query And Pushing The Count Of Results Retrieve number of results from an AWS CloudWatch Insights query. Docs
aws-cloudwatch-metricquery-dashboard-taskset aws, cloudwatch Get CloudWatch MetricQuery Insights URL Creates a URL to a AWS CloudWatch metrics dashboard with a running query. Docs
aws-cloudwatch-metricquery-sli aws, cloudwatch Running CloudWatch Metric Query And Pushing The Result Retrieve the result of an AWS CloudWatch Metrics Insights query. Docs
aws-cloudwatch-tagmetricquery-sli aws, cloudwatch Run CloudWatch Metric Query Across Set Of IDs And Push Metric Retrieve aggregate results from multiple AWS Cloudwatch Metrics Insights queries ran against tagged resources. This codebundle fetches a list of instance IDs filtered by tags, and uses them to run a set of AWS metric queries against the CloudWatch metrics insights API and pushes an aggregated/transformed value provided by the API as a metric. Docs
aws-ec2-securitycheck-taskset aws, ec2, cloudwatch Check For Untagged instances, Check For Dangling Volumes, Check For Open Routes, Check For Overused Instances, Check For Underused Instances, Check For Underused Volumes, Check For Overused Volumes Performs a suite of security checks against a set of AWS EC2 instances. Checks include untagged instances, dangling volumes, open routes. Docs
aws-s3-stalecheck-taskset aws, s3, bucket Create Report For Stale Buckets Identify stale AWS S3 buckets, based on last modified object timestamp. Docs
aws-vm-triage-taskset aws, ec2, cloudwatch Get Max VM CPU Utilization In Last 3 Hours, Get Lowest VM CPU Credits In Last 3 Hours, Get Max VM CPU Credit Usage In Last 3 hours, Get Max VM Memory Utilization In Last 3 Hours, Get Max VM Volume Usage In Last 3 Hours Triage and troubleshoot performance and usage of an AWS EC2 instance Docs
cert-manager-expirations-sli cert Inspect Certification Expiration Dates Retrieve number of expired TLS certificates managed by cert-manager within a given window. The metric pushed is the number of certs within the configured expiration window. Docs
cert-manager-healthcheck-sli cert Health Check cert-manager Pods Check the health of pods deployed by cert-manager. Docs
curl-generic-sli curl Run Curl Command and Push Metric A curl SLI for querying and extracting data from a generic curl call. Supports jq. Should prodice a single metric. Docs
curl-generic-taskset curl Run Curl Command and Add to Report A curl TaskSet for querying and extracting data from a generic curl call. Supports jq. Adds results to the report. Docs
datadog-metricquery-sli datadog Query Datadog Metrics Fetch the results of a datadog metric timeseries and push the extracted value as an SLI metric. Docs
datadog-system-load-sli datadog Check Datadog System Load Retrieve a DataDog instance's "System Load" metric Docs
discord-sendmessage-taskset discord Send Chat Message Sends a static Discord message via webhook. Contains optional configuration for including runsession info. Docs
dns-latency-sli dns Check DNS latency for Google Resolver Check DNS latency for Google Resolver. Docs
elasticsearch-health-sli elasticsearch Check Elasticsearch Cluster Health Check Elasticsearch cluster health Docs
gcp-gcloudcli-generic-sli gcp Run Gcloud CLI Command and Push metric Run arbitrary gcloud commands and parse their output for arbitrary values such as json to be submitted as a metric. Docs
gcp-gcloudcli-generic-taskset gcp Run Gcloud CLI Command and Push metric Run arbitrary gcloud commands and capture the stdout in a report. Docs
gcp-opssuite-logquery-dashboard-taskset gcp Get GCP Log Dashboard URL For Given Log Query Generate a link to the GCP Log Explorer. Docs
gcp-opssuite-logquery-sli gcp Running GCE Logging Query And Pushing Result Count Metric Retrieve the number of results of a GCP Log Explorer query. Docs
gcp-opssuite-metricquery-sli gcp Running GCP OpsSuite Metric Query Performs a metric query using a Google MQL statement on the Ops Suite API and pushes the result as an SLI metric. Docs
gcp-opssuite-promql-sli gcp Run Prometheus Instant Query Against Google Prom API Endpoint Performs a metric query using a PromQL statement on the Ops Suite API and pushes the result as an SLI metric. Docs
gcp-serviceshealth-sli gcp Get Number of GCP Incidents Effecting My Workspace This codebundle sets up a monitor for a specific region and GCP Product, which is then periodically checked for ongoing incidents based on the history available at https://status.cloud.google.com/incidents.json filtered based on severity level. Docs
github-actions-workflowtiming-sli github Get Average Run Time For Workflow Monitors the average timing of a github actions workflow file within a repo and returns the average runtime in minutes. Docs
github-get-repos-latency-sli github Check GitHub Latency With Get Repos Check GitHub latency by getting a list of repo names. Docs
github-get-repos-latency-taskset github Check Latency When Creating a New GitHub Issue Create a new issue in GitHub Issues. Docs
github-status-components-sli github Get Availability of GitHub or Individual GitHub Components Check status of the GitHub platform (https://www.githubstatus.com/) for a specified set of GitHub service components. The metric supplied is a aggregated percentage indicating the availability of the components with 1 = 100% available. Docs
github-status-incidents-sli github Get Number of Incidents Affecting GitHub Check for unresolved incidents related to GitHub services, and provides a count of ongoing incidents as a metric. Docs
github-status-maintenances-sli github Get Scheduled and Active GitHub Maintenance Windows Retrieve number of upcoming Github platform maintenances over a given window. Docs
gitlab-availability-sli gitlab Check GitLab Server Status Check availability of a GitLab server. Docs
gitlab-availability-taskset gitlab Check GitLab Server Status Troubleshoot issues with GitLab server availability. Docs
gitlab-get-repos-latency-sli gitlab Check GitLab Latency With Get Repos Check GitLab latency by getting a list of repo names. Docs
googlechat-sendmessage-taskset googlechat Send Chat Message Sends a static Google Chat message via webhook. Contains optional configuration for including runsession info. Docs
grafana-health-sli grafana Check Grafana Server Health Check Grafana server health. Docs
grpc-grpcurl-unary-sli grpc Run gRPCurl Command and Push Metric A gRPC curl SLI for querying and extracting data from a generic grpcurl call. Docs
grpc-grpcurl-unary-taskset grpc Run gRPCurl Command and Show Output A gRPC curl taskset for querying data from a generic grpcurl call and presenting the output. Docs
hello-world-taskset hello Hello World, Add One String To Report, Add Form Values To Report Basic Hello-World TaskSet Docs
http-latency-sli http Check HTTP Latency to Well Known URL Measure HTTP latency against a given URL. The returned metric is the number of seconds the request took as a float value. Docs
http-ok-sli http Checking HTTP URL Is Available And Timely Check if an HTTP request against a URL fails or times out of a given latency window. A return of 1 is considered a success, while a 0 is failure. Docs
jira-search-issues-latency-sli jira Search Jira Issues By Current User Check Jira latency when searching issues by current user. Docs
jira-search-issues-latency-taskset jira Create a new Jira Issue Create an issue in Jira. Docs
k8s-cortexmetrics-ingestor-health-sli k8s Determine Cortex Ingester Ring Health Uses kubectl to query the state of a ingestor ring and determine if it's healthy. Returns 1 if healthy, 0 if unhealthy. Docs
k8s-cortexmetrics-ingestor-health-taskset k8s Fetch Ingestor Ring Member List and Status Uses kubectl to query the state of a ingestor ring. Returns the json of injester id, status and timestamp. Docs
k8s-daemonset-healthcheck-sli k8s Health Check Daemonset Checks that the current state of a daemonset is healthy and returns a score of either 1 (healthy) or 0 (unhealthy). Docs
k8s-decommission-workloads-taskset k8s Generate Decomission Commands Searches a namespace for matching objects and provides the commands to decommission them. Docs
k8s-kubectl-apiserverhealth-sli k8s Running Kubectl Check Against API Server Check the health of a Kubernetes API server using kubectl. Returns 1 when OK, or a 0 in the case of an unhealthy API server. Docs
k8s-kubectl-eventquery-sli k8s Get Number Of Matching Events Returns the number of events with matching messages as an SLI metric. Docs
k8s-kubectl-sanitycheck-taskset k8s Check Kubeconfig Secret Exists, Test Generic Shell Service Connectivity, Check Kubectl contexts, Test Command Chains, Test Kubectl Get Pods Used for troubleshooting the shellservice-based kubectl service Docs
k8s-kubectl-top-sli k8s Running Kubectl Top And Extracting Metric Data Retreieve aggregate data via kubectl top command. Docs
k8s-patroni-healthcheck-sli k8s Determine Patroni Health Uses kubectl (or equivalent) to query the state of a patroni cluster and determine if it's healthy. Docs
k8s-patroni-lag-sli k8s Measure Patroni Member Lag Measures the maximum replica lag across a Patroni cluster. Docs
k8s-patroni-lag-taskset k8s Determine Patroni Health Detects and reinitializes laggy Patroni cluster members which are unable to catchup in replication using kubectl and patronictl. Docs
k8s-postgres-query-sli k8s Run Postgres Query And Return Result As Metric Runs a postgres SQL query and pushes the returned query result as an SLI metric. During execution, the SQL query should be passed to a Kubernetes workload that has access to the psql binary. The workload will run the query and return the result from stdout. Docs
k8s-postgres-query-taskset k8s Run Postgres Query And Results to Report Runs a postgres SQL query and pushes the returned result into a report. During execution, the SQL query should be passed to a Kubernetes workload that has access to the psql binary. The workload will run the query and return the results from stdout. Docs
k8s-postgres-triage-taskset k8s Get Standard Resources, Describe Custom Resources, Get Pod Logs & Events, Get Pod Resource Utilization, Get Running Configuration, Get Patroni Output, Run DB Queries Runs multiple Kubernetes and psql commands to report on the health of a postgres cluster. Docs
k8s-triage-deploymentreplicas-taskset k8s Fetch Logs, Get Related Events, Check Deployment Replicas Triages issues related to a deployment's replicas. Docs
k8s-triage-patroni-taskset k8s Get Patroni Status, Get Pods Status, Fetch Logs Taskset to triage issues related to patroni. Docs
k8s-triage-statefulset-taskset k8s Check StatefulSets Replicas Ready, Get Events For The StatefulSet, Get StatefulSet Logs, Get StatefulSet Manifests Dump A taskset for troubleshooting issues for StatefulSets and their related resources. Docs
k8s-troubleshoot-deployment-taskset k8s Troubleshoot Resourcing, Troubleshoot Events, Troubleshoot PVC, Troubleshoot Pods A taskset for troubleshooting general issues associated with typical kubernetes deployment resources. Supports API interactions via both the API client and Kubectl binary through RunWhen Shell Services. Docs
kong-ingress-health-gcp-promql-sli kong Get Access Token, Get HTTP Error Rate, Get Upstream Health, Get Request Latency Rate, Generate Kong Ingress Score Uses promql on the Ops Suite API to determine the health of a Kong managed ingress resource and pushes the result as an SLI metric. Produces a 1 for a healthy resource, or 0 for an unhealthy resource. Docs
mongodb-health-gcp-promql-sli mongodb Get Access Token, Get Instance Status, Get Connection Utilization Rate, Get MongoDB Member State Health, Get MongoDB Replication Lag, Get MongoDB Queue Size, Get Assertion Rate, Generate MongoDB Score Uses promql on the Ops Suite API to determine the health of a MongoDB database instance and pushes the result as an SLI metric. Produces a 1 for a healthy resource, or 0 for an unhealthy resource. Docs
msteams-send-message-taskset msteams Send a Message to an MS Teams Channel Send a message to an MS Teams channel. Docs
opsgenie-alert-taskset opsgenie Get Opsgenie System Info, Create An Alert Create an alert in Opsgenie. Docs
ping-host-availability-sli ping Ping host and collect packet lost percentage Ping a host and retrieve packet loss percentage. Docs
pingdom-health-sli pingdom Check Pingdom Health Check health of Pingdom platform. Docs
prometheus-queryinstant-transform-sli prometheus Querying Prometheus Instance And Pushing Aggregated Data Run a PromQL query against Prometheus instant query API, perform a provided transform, and return the result. Docs
prometheus-queryrange-transform-sli prometheus Querying Prometheus Instance And Pushing Aggregated Data Run a PromQL query against Prometheus range query API, perform a provided transform, and return the result. Docs
remote-http-ok-sli remote Checking HTTP URL Is Available And Timely Check that a HTTP endpoint is healthy and returns in a target latency. Docs
rest-basicauth-sli rest Request Data From Rest Endpoint A general purpose REST SLI for querying and extracting data from a REST endpoint that uses a basic auth flow. Docs
rest-explicitoauth2-basicauth-sli rest Request Data From Rest Endpoint A REST SLI for querying and extracting data from a REST endpoint that needs an explicit oauth2 flow. Where the token acquisition is handled using basic auth. Docs
rest-explicitoauth2-tokenheader-sli rest Request Data From Rest Endpoint A REST SLI for querying and extracting data from a REST endpoint that needs an explicit oauth2 flow. Where an access token must be acquired with a bearer token. Docs
rest-generic-sli rest Request Data From Rest Endpoint A general purpose REST SLI for querying and extracting data from a REST endpoint that uses a implicit oauth2 flow. Docs
rocketchat-sendmessage-taskset rocketchat Send Chat Message Sends a static Rocketchat message via webhook. Contains optional configuration for including runsession info. Docs
slack-sendmessage-taskset slack Send Chat Message Sends a static Slack message via webhook. Contains optional configuration for including runsession info. Docs
sli-alert-threshold-sli sli Check If SLI Within Incident Threshold An SLI which monitors another SLI that's submitting a 0-1 health score and when that health score falls below a threshold, will immediately trigger a taskset. When this SLI detects a rate below the threshold rate it submits a 1 to denote a signal was sent before returning to 0 when the monitored SLI is healthy. Docs
sysdig-monitor-metric-sli sysdig Query Sysdig Metric Data And Pushing Metric Queries the Sysdig data API to fetch metric data. Docs
sysdig-monitor-promqlmetric-sli sysdig Querying PromQL Endpoint And Pushing Metric Data Queries the Sysdig data API with a PromQL query to fetch metric data. Docs
twitter-query-tweets-sli twitter Query Twitter Queries Twitter to count amount of tweets within a specified time range for a specific user handle. Docs
twitter-query-tweets-taskset twitter Query Twitter Queries Twitter to fetch tweets within a specified time range for a specific user handle add them to a report. Docs
uptimecom-component-ok-sli uptimecom Check If Vault Endpoint Is Healthy Check the status of an Uptime.com component for a given site. It compares the operational state of the component with the list of allowed states, resulting in a 1 when acceptable, and 0 when not. Docs
vault-ok-sli vault Check If Vault Endpoint Is Healthy Check the health of a Vault server. The response code is used to determine if the service is healthy, resulting in a metric of 1 if it is, or 0 if not. Docs
web-triage-taskset web Validate Platform Egress, Perform Inspection On URL Troubleshoot and triage a URL to inspect it for common issues such as an expired certification, missing DNS records, etc. Docs