Open stewartshea opened 1 year ago
Oddly in the codecollection devtools this is running fine... so it's something platform or environment related
Based on the fact that the log dies and I saw an OOM on the pod, I think the incidents.json file is just very large and is crashing out the pod. It's surprising given that it doesn't take much to run in something like Gitpod (which isn't a super large instance), but it feels like the size of the incidents file is just too big and we need to optimize the query.
The incidents.json file is about 2.9MB (currently)... while this isn't massive, it feels like the way we parse through it is where the problems might be arising. I'm going to attempt a rewrite with jmespath to see if we can lighten the load on this code bundle and then see where we stand.
I wrote up a new version of this with jmespath and still noticed the OOM issues. I modified the resource limits in my environment and it's running successfully, but we can tell this is a high memory process along with a high duration/tax;
kubectl top pod sandbox--gcp-status--sli--9fba3-6c7586595c-l7xsb -n sandbox
NAME CPU(cores) MEMORY(bytes)
sandbox--gcp-status--sli--9fba3-6c7586595c-l7xsb 68m 213Mi
From the robot logs, the call alone to download the json file is a very long call ... We need to think how to optimize this since it doesn't make a whole lot of sense to run multiple copies of this service on a frequent basis. I'd suspect it's something we split into a single service that we host (e.g. post the json to our own public gcp bucket or shared volume), and then the codebundle just filters that response.
Otherwise we end up in a LOT of wasted cycles at 3M per call, every 30s, for every SLI.
Based on the test with the old code bundle, we can see that the new version is still more efficient, but it doesn't negate the previous comments into needing to build a cache for this.
Stats from the current codebundle version... lower memory utilization but longer overall codebundle runtime;
kubectl top pod runwhen-beta--gcp-status--sli--9fba3-6cb66cc48d-wb67x -n sandbox
NAME CPU(cores) MEMORY(bytes)
sandbox--gcp-status--sli--9fba3-6cb66cc48d-wb67x 101m 137Mi
Tested hosting the file in a storage bucket which didn't seem to improve the fetch performance;
@j-pye
Expected Behavior
The SLI should run correctly without error.
Current Behavior
Not much in the logs but the GCP status codebundle is crashlooping; I'll try to dig into other robot logs that hopefully exist in GCP for this (though the UI isn't showing any logs)
Possible Solution
Steps to Reproduce