runwhen-contrib / rw-public-codecollection

RunWhen Public Codecollection Repository - Open Source troubleshooting runbook library for Kubernetes and cloud infrastructure components.
Apache License 2.0
39 stars 5 forks source link

GCP Incident Status is crashlooping #89

Open stewartshea opened 1 year ago

stewartshea commented 1 year ago

Expected Behavior

The SLI should run correctly without error.

Current Behavior

Not much in the logs but the GCP status codebundle is crashlooping; I'll try to dig into other robot logs that hopefully exist in GCP for this (though the UI isn't showing any logs)

{"level": "DEBUG", "unixtime": 1678460757.4621816, "thread": 139645435500352, "location": "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py:_new_conn:228", "app": {"name": "runrobot", "releaseId": "v1", "message": "Starting new HTTP connection (1): pushgateway.prometheus.svc:9091"}}
{"level": "DEBUG", "unixtime": 1678460757.5899405, "thread": 139645435500352, "location": "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py:_make_request:456", "app": {"name": "runrobot", "releaseId": "v1", "message": "http://pushgateway.prometheus.svc:9091 \"POST /metrics/job/pushgateway/ HTTP/1.1\" 200 0"}}
{"level": "DEBUG", "unixtime": 1678460757.5930176, "thread": 139645435500352, "location": "/robot-runtime/runrobot.py:main:285", "app": {"name": "runrobot", "releaseId": "v1", "message": "starting .robot file at /collection/codebundles/gcp-serviceshealth/sli.robot, sending logs to ./robot_logs"}}
{"level": "DEBUG", "unixtime": 1678460757.6573029, "thread": 139645435500352, "location": "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py:_new_conn:1003", "app": {"name": "runrobot", "releaseId": "v1", "message": "Starting new HTTPS connection (1): vault.beta.runwhen.com:443"}}
{"level": "DEBUG", "unixtime": 1678460758.1057024, "thread": 139645435500352, "location": "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py:_make_request:456", "app": {"name": "runrobot", "releaseId": "v1", "message": "https://vault.beta.runwhen.com:443 \"POST /v1/auth/kubernetes-beta-location-us-west2-01/login HTTP/1.1\" 200 827"}}
{"level": "DEBUG", "unixtime": 1678460758.1085622, "thread": 139645435500352, "location": "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py:_new_conn:1003", "app": {"name": "runrobot", "releaseId": "v1", "message": "Starting new HTTPS connection (1): vault.beta.runwhen.com:443"}}
{"level": "DEBUG", "unixtime": 1678460758.2624612, "thread": 139645435500352, "location": "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py:_make_request:456", "app": {"name": "runrobot", "releaseId": "v1", "message": "https://vault.beta.runwhen.com:443 \"GET /v1/workspaces/data/getting-started/rw-service-account-username HTTP/1.1\" 200 328"}}
{"level": "DEBUG", "unixtime": 1678460758.3586147, "thread": 139645435500352, "location": "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py:_new_conn:1003", "app": {"name": "runrobot", "releaseId": "v1", "message": "Starting new HTTPS connection (1): vault.beta.runwhen.com:443"}}
{"level": "DEBUG", "unixtime": 1678460758.7966275, "thread": 139645435500352, "location": "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py:_make_request:456", "app": {"name": "runrobot", "releaseId": "v1", "message": "https://vault.beta.runwhen.com:443 \"POST /v1/auth/kubernetes-beta-location-us-west2-01/login HTTP/1.1\" 200 827"}}
{"level": "DEBUG", "unixtime": 1678460758.7997262, "thread": 139645435500352, "location": "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py:_new_conn:1003", "app": {"name": "runrobot", "releaseId": "v1", "message": "Starting new HTTPS connection (1): vault.beta.runwhen.com:443"}}
{"level": "DEBUG", "unixtime": 1678460758.9615796, "thread": 139645435500352, "location": "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py:_make_request:456", "app": {"name": "runrobot", "releaseId": "v1", "message": "https://vault.beta.runwhen.com:443 \"GET /v1/workspaces/data/getting-started/rw-service-account-pw HTTP/1.1\" 200 320"}}
{"level": "DEBUG", "unixtime": 1678460759.056962, "thread": 139645435500352, "location": "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py:_new_conn:1003", "app": {"name": "runrobot", "releaseId": "v1", "message": "Starting new HTTPS connection (1): papi.beta.runwhen.com:443"}}
{"level": "DEBUG", "unixtime": 1678460759.405444, "thread": 139645435500352, "location": "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py:_make_request:456", "app": {"name": "runrobot", "releaseId": "v1", "message": "https://papi.beta.runwhen.com:443 \"POST /api/v3/token/ HTTP/1.1\" 200 443"}}
{"level": "INFO", "unixtime": 1678460759.4084346, "thread": 139645435500352, "location": "/robot-runtime/runrobot.py:main:311", "app": {"name": "runrobot", "releaseId": "v1", "message": "running task titles from RW_TASK_TITLES *"}}
{"level": "DEBUG", "unixtime": 1678460759.4106758, "thread": 139645435500352, "location": "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py:_new_conn:228", "app": {"name": "runrobot", "releaseId": "v1", "message": "Starting new HTTP connection (1): pushgateway.prometheus.svc:9091"}}
{"level": "DEBUG", "unixtime": 1678460759.429455, "thread": 139645435500352, "location": "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py:_make_request:456", "app": {"name": "runrobot", "releaseId": "v1", "message": "http://pushgateway.prometheus.svc:9091 \"POST /metrics/job/pushgateway/ HTTP/1.1\" 200 0"}}
{"level": "DEBUG", "unixtime": 1678460759.4593892, "thread": 139645435500352, "location": "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py:_new_conn:228", "app": {"name": "runrobot", "releaseId": "v1", "message": "Starting new HTTP connection (1): pushgateway.prometheus.svc:9091"}}
{"level": "DEBUG", "unixtime": 1678460759.495057, "thread": 139645435500352, "location": "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py:_make_request:456", "app": {"name": "runrobot", "releaseId": "v1", "message": "http://pushgateway.prometheus.svc:9091 \"POST /metrics/job/pushgateway/ HTTP/1.1\" 200 0"}}
{"level": "DEBUG", "unixtime": 1678460782.761559, "thread": 139645435500352, "location": "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py:_new_conn:1003", "app": {"name": "runrobot", "releaseId": "v1", "message": "Starting new HTTPS connection (1): status.cloud.google.com:443"}}
{"level": "DEBUG", "unixtime": 1678460783.8673108, "thread": 139645435500352, "location": "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py:_make_request:456", "app": {"name": "runrobot", "releaseId": "v1", "message": "https://status.cloud.google.com:443 \"GET /incidents.json HTTP/1.1\" 200 3032651"}}

Possible Solution

Steps to Reproduce

  1. Configure gcp-serviceshealth
stewartshea commented 1 year ago

Oddly in the codecollection devtools this is running fine... so it's something platform or environment related

stewartshea commented 1 year ago

Based on the fact that the log dies and I saw an OOM on the pod, I think the incidents.json file is just very large and is crashing out the pod. It's surprising given that it doesn't take much to run in something like Gitpod (which isn't a super large instance), but it feels like the size of the incidents file is just too big and we need to optimize the query.

stewartshea commented 1 year ago

The incidents.json file is about 2.9MB (currently)... while this isn't massive, it feels like the way we parse through it is where the problems might be arising. I'm going to attempt a rewrite with jmespath to see if we can lighten the load on this code bundle and then see where we stand.

stewartshea commented 1 year ago

I wrote up a new version of this with jmespath and still noticed the OOM issues. I modified the resource limits in my environment and it's running successfully, but we can tell this is a high memory process along with a high duration/tax;

kubectl top pod sandbox--gcp-status--sli--9fba3-6c7586595c-l7xsb -n sandbox
NAME                                               CPU(cores)   MEMORY(bytes)   
sandbox--gcp-status--sli--9fba3-6c7586595c-l7xsb   68m          213Mi         

From the robot logs, the call alone to download the json file is a very long call ... We need to think how to optimize this since it doesn't make a whole lot of sense to run multiple copies of this service on a frequent basis. I'd suspect it's something we split into a single service that we host (e.g. post the json to our own public gcp bucket or shared volume), and then the codebundle just filters that response.

Otherwise we end up in a LOT of wasted cycles at 3M per call, every 30s, for every SLI.

image

stewartshea commented 1 year ago

Based on the test with the old code bundle, we can see that the new version is still more efficient, but it doesn't negate the previous comments into needing to build a cache for this.

Stats from the current codebundle version... lower memory utilization but longer overall codebundle runtime;

kubectl top pod runwhen-beta--gcp-status--sli--9fba3-6cb66cc48d-wb67x -n sandbox
NAME                                                    CPU(cores)   MEMORY(bytes)   
sandbox--gcp-status--sli--9fba3-6cb66cc48d-wb67x   101m         137Mi

image

stewartshea commented 1 year ago

Tested hosting the file in a storage bucket which didn't seem to improve the fetch performance; image

jon-funk commented 1 year ago

@j-pye