Closed azeemshaikh38 closed 3 years ago
FYI, we still do not have a successful run for >1 week. The job just ends for some reason and there are no logs/events which specify why.
Those logs are aggregated so it's possible they're from two separate pods? i.e. the first one exited and then a second one started? I can't see the past pods info anymore on that though, was this cronjob recently recreated?
Searching for some of these previously run pod IDs in the logs, I see:
{
"jsonPayload": {
"reportingComponent": "",
"involvedObject": {
"uid": "443fbe7e-34c5-4a09-8147-f9a4728922fd",
"apiVersion": "v1",
"resourceVersion": "106807510",
"fieldPath": "spec.containers{run-score}",
"name": "daily-score-1621908000-6qwtc",
"kind": "Pod",
"namespace": "default"
},
"reportingInstance": "",
"message": "Stopping container run-score",
"type": "Normal",
"source": {
"component": "kubelet",
"host": "gke-openssf-default-pool-aeaa7a9c-wcz1"
},
"apiVersion": "v1",
"metadata": {
"uid": "4ad5f83e-33e9-4467-b07b-c59da7fa5e49",
"name": "daily-score-1621908000-6qwtc.16822f7b9e0be112",
"managedFields": [
...OMITTED....
],
"resourceVersion": "180503",
"creationTimestamp": "2021-05-25T02:50:23Z",
"namespace": "default"
},
"eventTime": null,
"reason": "Killing",
"kind": "Event"
},
"resource": {
"type": "k8s_pod",
"labels": {
"namespace_name": "default",
"location": "us-central1-c",
"cluster_name": "openssf",
"pod_name": "daily-score-1621908000-6qwtc",
"project_id": "openssf"
}
},
"timestamp": "2021-05-25T02:50:23Z",
"severity": "INFO",
"logName": "projects/openssf/logs/events",
"receiveTimestamp": "2021-05-25T02:50:28.893360951Z"
}
In particular, the "reason": "Killing",
bit indicates that it got evicted by Kubernetes somehow. Perhaps due to resource constraints (like OOM) now that you're running some batch workers as well?
Those logs are aggregated so it's possible they're from two separate pods? i.e. the first one exited and then a second one started? I can't see the past pods info anymore on that though, was this cronjob recently recreated?
Same, I can't see past pods info either making it difficult to debug. I would be surprised if they were 2 pods running though, we have a policy to schedule only one at a time.
Perhaps due to resource constraints (like OOM) now that you're running some batch workers as well?
The PubSub worker is using ~30MB in memory. It shouldn't affect the regular cron job. If this is an OOM, my guess would be the bunch of ArchiveURL calls that were added recently. I think we end up storing them in memory. Should we disable the FrozenDeps and SecurityPolicy checks to see if that helps? @laurentsimon FYI.
try disabling FrozenDeps, yes. It's on my TODO list to share the archive between sub-checks, either by restructuring the code or using the BLOB_CACHE option if possible. If this is the culprit, I'll bump the priority to do it.
do the pods have some disk/storage access? Is there a reason we're storing the .tar in mem and not to disk? I suppose disk will be slower and we need to clean up after each tests. There are (very small) risks of directory traversal to be aware of as well when unzipping/untaring.
Other reasons?
try disabling FrozenDeps, yes. It's on my TODO list to share the archive between sub-checks, either by restructuring the code or using the BLOB_CACHE option if possible. If this is the culprit, I'll bump the priority to do it.
It is not a good idea to maintain state
do the pods have some disk/storage access? Is there a reason we're storing the .tar in mem and not to disk? I suppose disk will be slower and we need to clean up after each tests. There are (very small) risks of directory traversal to be aware of as well when unzipping/untaring.
Other reasons?
Does anyone know how to get the stack trace for these killed pods? I think we are only guessing about the cause so far.
In the meantime, I sent out https://github.com/ossf/scorecard/pull/518 to see if we can unblock ourselves here. I think its important to get the cron job running again.
Pod kill reason would have been logged within k8s.
Usually it is OOM.
k describe po daily-score-manual-x8qw9-s6vpv
Name: daily-score-manual-x8qw9-s6vpv
Namespace: default
Priority: 0
Node: gke-openssf-default-pool-aeaa7a9c-dktp/10.128.0.4
Start Time: Fri, 28 May 2021 03:35:38 +0000
Labels: controller-uid=66839239-b7b8-4acf-b409-e668dff3cb18
job-name=daily-score-manual-x8qw9
Annotations: <none>
Status: Running
IP: 10.0.8.19
IPs:
IP: 10.0.8.19
Controlled By: Job/daily-score-manual-x8qw9
Containers:
run-score:
Container ID: docker://27fcb26c61f8bfe05b9cdb24f9fdc983c88b57d3cb9b6bf46fb018ec1f19cd1f
Image: gcr.io/openssf/cron:latest
Image ID: docker-pullable://gcr.io/openssf/cron@sha256:22e8b7d0cb74dfb33f81200143caf7c132a6d9f688242447be92aef8d6c1a92b
Port: <none>
Host Port: <none>
State: Running
Started: Fri, 28 May 2021 12:43:38 +0000
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Fri, 28 May 2021 09:41:17 +0000
Finished: Fri, 28 May 2021 12:43:37 +0000
Ready: True
Restart Count: 3
Limits:
memory: 1Gi
Requests:
memory: 1Gi
Environment:
GITHUB_AUTH_TOKEN: <set to the key 'token' in secret 'github'> Optional: false
GCS_BUCKET: ossf-scorecards
BLOB_URL: gs://ossf-scorecards-cache
USE_BLOB_CACHE: true
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-jhqm8 (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
default-token-jhqm8:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-jhqm8
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events: <none>
OOM Killed
The cron job still has not finished a successful run. I don't want to spend any more cycles on debugging/fixing it. Instead I propose that we work on productionizing the PubSub cron instead. I'm closing this issue, if someone feels otherwise, please feel free to re-open.
Describe the bug The cron job seems to be looping. After completing a few hundred repos, it starts again from the top.