ossf / scorecard

OpenSSF Scorecard - Security health metrics for Open Source
https://scorecard.dev
Apache License 2.0
4.3k stars 470 forks source link

BUG - scorecard cron evicted low on memory. #265

Closed naveensrinivasan closed 3 years ago

naveensrinivasan commented 3 years ago

Describe the bug The scorecard cron job is evicted because the node was low on memory. The existing cron job didn't use the blobcahe and was using an in-memory cache.

Name:           daily-score-1615334400-jp8tq
Namespace:      default
Priority:       0
Node:           gke-openssf-default-pool-aeaa7a9c-7sdk/
Start Time:     Wed, 10 Mar 2021 22:11:15 +0000
Labels:         controller-uid=a0894cd4-ae96-45df-9f4f-bf4072ee035c
                job-name=daily-score-1615334400
Annotations:    <none>
Status:         Failed
Reason:         Evicted
Message:        The node was low on resource: memory. Container run-score was using 1807204Ki, which exceeds its request of 0.
IP:

Expected behavior

naveensrinivasan commented 3 years ago

@inferno-chromium FYI...

naveensrinivasan commented 3 years ago

Updated the cron job to use the blobcache

image

naveensrinivasan commented 3 years ago

The pods are getting terminated even with USE_BLOB_CACHE. The scorecard is using about 1.7GB of memory, which is a memory leak.

k describe po daily-score-manual-f8rqw-dqbgx
Name:           daily-score-manual-f8rqw-dqbgx
Namespace:      default
Priority:       0
Node:           gke-openssf-default-pool-aeaa7a9c-bnbd/
Start Time:     Thu, 11 Mar 2021 14:40:58 +0000
Labels:         controller-uid=723d5a7e-ef40-4e94-920f-df1362a64664
                job-name=daily-score-manual-f8rqw
Annotations:    <none>
Status:         Failed
Reason:         Evicted
Message:        The node was low on resource: memory. Container run-score was using 1785720Ki, which exceeds its request of 0.
IP:
IPs:            <none>
Controlled By:  Job/daily-score-manual-f8rqw
Containers:
  run-score:
    Image:      gcr.io/openssf/cron:latest
    Port:       <none>
    Host Port:  <none>
    Command:
      /bin/sh
      -c
      ./cron/cron.sh
    Environment:
      GITHUB_AUTH_TOKEN:  <set to the key 'token' in secret 'github'>  Optional: false
      GCS_BUCKET:         ossf-scorecards
      BLOB_URL:           gs://ossf-scorecards-cache
      USE_BLOB_CACHE:     true
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-jhqm8 (to)
naveensrinivasan commented 3 years ago

@inferno-chromium We have a couple of options for this

  1. Short-term fix - Change the node-type to be of Higher memory and run scorecard so that we can get scorecard cron results. The last result we have is from Feb 16, 2021. This will get us results that can be consumed by our consumers. This does not involve much effort other than changing the node type k8s.

  2. Long-term fix - Figure out the memory leak - Run pprof on scorecard. Then address the memory leak and test it. Then deploy it to prod to run cron. This involves a lot more effort and time. During that period scorecard results won't be available to the rest of the consumers.

My suggestion would be we implement the 1 solution and keep working on the 2 option which is a Long-term fix.

What are your thoughts on this?

inferno-chromium commented 3 years ago

@inferno-chromium We have a couple of options for this

  1. Short-term fix - Change the node-type to be of Higher memory and run scorecard so that we can get scorecard cron results. The last result we have is from Feb 16, 2021. This will get us results that can be consumed by our consumers. This does not involve much effort other than changing the node type k8s.
  2. Long-term fix - Figure out the memory leak - Run pprof on scorecard. Then address the memory leak and test it. Then deploy it to prod to run cron. This involves a lot more effort and time. During that period scorecard results won't be available to the rest of the consumers.

My suggestion would be we implement the 1 solution and keep working on the 2 option which is a Long-term fix.

What are your thoughts on this?

Yes just use 1 forever, we can have any memory size instance as you want. can you fix instance type in cron.yaml or whereever?

naveensrinivasan commented 3 years ago

@inferno-chromium We have a couple of options for this

  1. Short-term fix - Change the node-type to be of Higher memory and run scorecard so that we can get scorecard cron results. The last result we have is from Feb 16, 2021. This will get us results that can be consumed by our consumers. This does not involve much effort other than changing the node type k8s.
  2. Long-term fix - Figure out the memory leak - Run pprof on scorecard. Then address the memory leak and test it. Then deploy it to prod to run cron. This involves a lot more effort and time. During that period scorecard results won't be available to the rest of the consumers.

My suggestion would be we implement the 1 solution and keep working on the 2 option which is a Long-term fix. What are your thoughts on this?

Yes just use 1 forever, we can have any memory size instance as you want. can you fix instance type in cron.yaml or whereever?

I don't have permission to create a different node type. I would have to get help from @dlorenc to help with this.

dlorenc commented 3 years ago

This might be easier - it doesn't look like it's using much memory:

Message:        The node was low on resource: memory. Container run-score was using 1785720Ki, which exceeds its request of 0.

1785720Ki is

You might have to set a limit/request on the pod in the cron.

naveensrinivasan commented 3 years ago

I agree , I could set a higher resource limit on the pod. But the pod was already using 1.8 GB of memory. I don't know if it is going to go way over and what that number would be.

The nodes default is 2.8GB

If I set the pod max to be 2.0Gb then the pod is going to being oom killed, which we don't want now in the short-term.

If I set the pod mem max 4.0 Gb then it would not be scheduled because there aren't nodes that accommodate those requirements for memory.

Thoughts?

naveensrinivasan commented 3 years ago

The pods are getting oom killed with pod presets for 1Gb memory.

➜  badge git:(feat/scorecard-badge) ✗ k get events
LAST SEEN   TYPE      REASON             OBJECT                                        MESSAGE
3m47s       Warning   UnexpectedJob      cronjob/daily-score                           Saw a job that the controller did not create or forgot: daily-score-manual-5pdwg
19m         Warning   NodeSysctlChange   node/gke-openssf-default-pool-aeaa7a9c-bnbd
14m         Warning   SystemOOM          node/gke-openssf-default-pool-aeaa7a9c-bnbd   System OOM encountered, victim process: scorecard, pid: 1053112
14m         Warning   OOMKilling         node/gke-openssf-default-pool-aeaa7a9c-bnbd   Memory cgroup out of memory: Killed process 1053112 (scorecard) total-vm:2140268kB, anon-rss:1042320kB, file-rss:12820kB, shmem-rss:0kB, UID:0 pgtables:2180kB oom_score_adj:741
9m22s       Warning   OOMKilling         node/gke-openssf-default-pool-aeaa7a9c-bnbd   Memory cgroup out of memory: Killed process 1055912 (scorecard) total-vm:1802200kB, anon-rss:1042520kB, file-rss:12884kB, shmem-rss:0kB, UID:0 pgtables:2168kB oom_score_adj:741
9m22s       Warning   SystemOOM          node/gke-openssf-default-pool-aeaa7a9c-bnbd   System OOM encountered, victim process: scorecard, pid: 1055912
25m         Warning   NodeSysctlChange   node/gke-openssf-default-pool-aeaa7a9c-bx9r
16m         Warning   NodeSysctlChange   node/gke-openssf-default-pool-aeaa7a9c-wg5f
➜  badge git:(feat/scorecard-badge) ✗ k describe po daily-score-manual-5pdwg-9tzsp
Name:         daily-score-manual-5pdwg-9tzsp
Namespace:    default
Priority:     0
Node:         gke-openssf-default-pool-aeaa7a9c-bnbd/10.128.0.30
Start Time:   Thu, 11 Mar 2021 16:50:59 +0000
Labels:       controller-uid=93481296-a298-40f6-9f5d-859899052764
              job-name=daily-score-manual-5pdwg
Annotations:  <none>
Status:       Running
IP:           10.0.3.130
IPs:
  IP:           10.0.3.130
Controlled By:  Job/daily-score-manual-5pdwg
Containers:
  run-score:
    Container ID:  docker://cade9e5af4d4b0c12ca6d8d93f582e4e64ad24820d9f862906a8be110d9fd910
    Image:         gcr.io/openssf/cron:latest
    Image ID:      docker-pullable://gcr.io/openssf/cron@sha256:7373afbae5c6df42faefac54bd7f4a717312325fa755f30211eb0b324da411e1
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
      -c
      ./cron/cron.sh
    State:          Running
      Started:      Thu, 11 Mar 2021 16:51:02 +0000
    Ready:          True
    Restart Count:  0
    Limits:
      memory:  1Gi
    Requests:
      memory:  1Gi
    Environment:
      GITHUB_AUTH_TOKEN:  <set to the key 'token' in secret 'github'>  Optional: false
      GCS_BUCKET:         ossf-scorecards
      BLOB_URL:           gs://ossf-scorecards-cache
      USE_BLOB_CACHE:     true
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-jhqm8 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  default-token-jhqm8:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-jhqm8
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:          <none>

Looks like we would have to increase the node size to have larger memory and remove the pod presets for memory.

naveensrinivasan commented 3 years ago

image

Fails with this repo github.com/ApolloAuto/apollo , thanks @dlorenc for the tip.