BUG: Cron job looping - Githubissues

azeemshaikh38 commented 3 years ago

Describe the bug The cron job seems to be looping. After completing a few hundred repos, it starts again from the top.

azeemshaikh38 commented 3 years ago

FYI, we still do not have a successful run for >1 week. The job just ends for some reason and there are no logs/events which specify why.

oliverchang commented 3 years ago

Those logs are aggregated so it's possible they're from two separate pods? i.e. the first one exited and then a second one started? I can't see the past pods info anymore on that though, was this cronjob recently recreated?

oliverchang commented 3 years ago

Searching for some of these previously run pod IDs in the logs, I see:

{
  "jsonPayload": {
    "reportingComponent": "",
    "involvedObject": {
      "uid": "443fbe7e-34c5-4a09-8147-f9a4728922fd",
      "apiVersion": "v1",
      "resourceVersion": "106807510",
      "fieldPath": "spec.containers{run-score}",
      "name": "daily-score-1621908000-6qwtc",
      "kind": "Pod",
      "namespace": "default"
    },
    "reportingInstance": "",
    "message": "Stopping container run-score",
    "type": "Normal",
    "source": {
      "component": "kubelet",
      "host": "gke-openssf-default-pool-aeaa7a9c-wcz1"
    },
    "apiVersion": "v1",
    "metadata": {
      "uid": "4ad5f83e-33e9-4467-b07b-c59da7fa5e49",
      "name": "daily-score-1621908000-6qwtc.16822f7b9e0be112",
      "managedFields": [
            ...OMITTED....
      ],
      "resourceVersion": "180503",
      "creationTimestamp": "2021-05-25T02:50:23Z",
      "namespace": "default"
    },
    "eventTime": null,
    "reason": "Killing",
    "kind": "Event"
  },
  "resource": {
    "type": "k8s_pod",
    "labels": {
      "namespace_name": "default",
      "location": "us-central1-c",
      "cluster_name": "openssf",
      "pod_name": "daily-score-1621908000-6qwtc",
      "project_id": "openssf"
    }
  },
  "timestamp": "2021-05-25T02:50:23Z",
  "severity": "INFO",
  "logName": "projects/openssf/logs/events",
  "receiveTimestamp": "2021-05-25T02:50:28.893360951Z"
}

In particular, the "reason": "Killing", bit indicates that it got evicted by Kubernetes somehow. Perhaps due to resource constraints (like OOM) now that you're running some batch workers as well?

azeemshaikh38 commented 3 years ago

Those logs are aggregated so it's possible they're from two separate pods? i.e. the first one exited and then a second one started? I can't see the past pods info anymore on that though, was this cronjob recently recreated?

Same, I can't see past pods info either making it difficult to debug. I would be surprised if they were 2 pods running though, we have a policy to schedule only one at a time.

Perhaps due to resource constraints (like OOM) now that you're running some batch workers as well?

The PubSub worker is using ~30MB in memory. It shouldn't affect the regular cron job. If this is an OOM, my guess would be the bunch of ArchiveURL calls that were added recently. I think we end up storing them in memory. Should we disable the FrozenDeps and SecurityPolicy checks to see if that helps? @laurentsimon FYI.

laurentsimon commented 3 years ago

try disabling FrozenDeps, yes. It's on my TODO list to share the archive between sub-checks, either by restructuring the code or using the BLOB_CACHE option if possible. If this is the culprit, I'll bump the priority to do it.

do the pods have some disk/storage access? Is there a reason we're storing the .tar in mem and not to disk? I suppose disk will be slower and we need to clean up after each tests. There are (very small) risks of directory traversal to be aware of as well when unzipping/untaring.

Other reasons?

naveensrinivasan commented 3 years ago

try disabling FrozenDeps, yes. It's on my TODO list to share the archive between sub-checks, either by restructuring the code or using the BLOB_CACHE option if possible. If this is the culprit, I'll bump the priority to do it.

It is not a good idea to maintain state

do the pods have some disk/storage access? Is there a reason we're storing the .tar in mem and not to disk? I suppose disk will be slower and we need to clean up after each tests. There are (very small) risks of directory traversal to be aware of as well when unzipping/untaring.

Other reasons?

azeemshaikh38 commented 3 years ago

Does anyone know how to get the stack trace for these killed pods? I think we are only guessing about the cause so far.

In the meantime, I sent out https://github.com/ossf/scorecard/pull/518 to see if we can unblock ourselves here. I think its important to get the cron job running again.

naveensrinivasan commented 3 years ago

Pod kill reason would have been logged within k8s.

naveensrinivasan commented 3 years ago

Usually it is OOM.

naveensrinivasan commented 3 years ago

k describe po daily-score-manual-x8qw9-s6vpv
Name:         daily-score-manual-x8qw9-s6vpv
Namespace:    default
Priority:     0
Node:         gke-openssf-default-pool-aeaa7a9c-dktp/10.128.0.4
Start Time:   Fri, 28 May 2021 03:35:38 +0000
Labels:       controller-uid=66839239-b7b8-4acf-b409-e668dff3cb18
              job-name=daily-score-manual-x8qw9
Annotations:  <none>
Status:       Running
IP:           10.0.8.19
IPs:
  IP:           10.0.8.19
Controlled By:  Job/daily-score-manual-x8qw9
Containers:
  run-score:
    Container ID:   docker://27fcb26c61f8bfe05b9cdb24f9fdc983c88b57d3cb9b6bf46fb018ec1f19cd1f
    Image:          gcr.io/openssf/cron:latest
    Image ID:       docker-pullable://gcr.io/openssf/cron@sha256:22e8b7d0cb74dfb33f81200143caf7c132a6d9f688242447be92aef8d6c1a92b
    Port:           <none>
    Host Port:      <none>
    State:          Running
      Started:      Fri, 28 May 2021 12:43:38 +0000
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Fri, 28 May 2021 09:41:17 +0000
      Finished:     Fri, 28 May 2021 12:43:37 +0000
    Ready:          True
    Restart Count:  3
    Limits:
      memory:  1Gi
    Requests:
      memory:  1Gi
    Environment:
      GITHUB_AUTH_TOKEN:  <set to the key 'token' in secret 'github'>  Optional: false
      GCS_BUCKET:         ossf-scorecards
      BLOB_URL:           gs://ossf-scorecards-cache
      USE_BLOB_CACHE:     true
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-jhqm8 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  default-token-jhqm8:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-jhqm8
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:          <none>

OOM Killed

azeemshaikh38 commented 3 years ago

The cron job still has not finished a successful run. I don't want to spend any more cycles on debugging/fixing it. Instead I propose that we work on productionizing the PubSub cron instead. I'm closing this issue, if someone feels otherwise, please feel free to re-open.

ossf / scorecard

BUG: Cron job looping #512