Closed naveensrinivasan closed 3 years ago
@inferno-chromium FYI...
Updated the cron
job to use the blobcache
The pods are getting terminated even with USE_BLOB_CACHE. The scorecard is using about 1.7GB of memory, which is a memory leak.
k describe po daily-score-manual-f8rqw-dqbgx
Name: daily-score-manual-f8rqw-dqbgx
Namespace: default
Priority: 0
Node: gke-openssf-default-pool-aeaa7a9c-bnbd/
Start Time: Thu, 11 Mar 2021 14:40:58 +0000
Labels: controller-uid=723d5a7e-ef40-4e94-920f-df1362a64664
job-name=daily-score-manual-f8rqw
Annotations: <none>
Status: Failed
Reason: Evicted
Message: The node was low on resource: memory. Container run-score was using 1785720Ki, which exceeds its request of 0.
IP:
IPs: <none>
Controlled By: Job/daily-score-manual-f8rqw
Containers:
run-score:
Image: gcr.io/openssf/cron:latest
Port: <none>
Host Port: <none>
Command:
/bin/sh
-c
./cron/cron.sh
Environment:
GITHUB_AUTH_TOKEN: <set to the key 'token' in secret 'github'> Optional: false
GCS_BUCKET: ossf-scorecards
BLOB_URL: gs://ossf-scorecards-cache
USE_BLOB_CACHE: true
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-jhqm8 (to)
@inferno-chromium We have a couple of options for this
Short-term fix - Change the node-type to be of Higher memory and run scorecard
so that we can get scorecard cron
results. The last result we have is from Feb 16, 2021. This will get us results that can be consumed by our consumers. This does not involve much effort other than changing the node type k8s.
Long-term fix - Figure out the memory leak - Run pprof
on scorecard. Then address the memory leak and test it. Then deploy it to prod
to run cron
. This involves a lot more effort and time. During that period scorecard results won't be available to the rest of the consumers.
My suggestion would be we implement the 1
solution and keep working on the 2
option which is a Long-term fix.
What are your thoughts on this?
@inferno-chromium We have a couple of options for this
- Short-term fix - Change the node-type to be of Higher memory and run
scorecard
so that we can get scorecardcron
results. The last result we have is from Feb 16, 2021. This will get us results that can be consumed by our consumers. This does not involve much effort other than changing the node type k8s.- Long-term fix - Figure out the memory leak - Run
pprof
on scorecard. Then address the memory leak and test it. Then deploy it toprod
to runcron
. This involves a lot more effort and time. During that period scorecard results won't be available to the rest of the consumers.My suggestion would be we implement the
1
solution and keep working on the2
option which is a Long-term fix.What are your thoughts on this?
Yes just use 1 forever, we can have any memory size instance as you want. can you fix instance type in cron.yaml or whereever?
@inferno-chromium We have a couple of options for this
- Short-term fix - Change the node-type to be of Higher memory and run
scorecard
so that we can get scorecardcron
results. The last result we have is from Feb 16, 2021. This will get us results that can be consumed by our consumers. This does not involve much effort other than changing the node type k8s.- Long-term fix - Figure out the memory leak - Run
pprof
on scorecard. Then address the memory leak and test it. Then deploy it toprod
to runcron
. This involves a lot more effort and time. During that period scorecard results won't be available to the rest of the consumers.My suggestion would be we implement the
1
solution and keep working on the2
option which is a Long-term fix. What are your thoughts on this?Yes just use 1 forever, we can have any memory size instance as you want. can you fix instance type in cron.yaml or whereever?
I don't have permission to create a different node type. I would have to get help from @dlorenc to help with this.
This might be easier - it doesn't look like it's using much memory:
Message: The node was low on resource: memory. Container run-score was using 1785720Ki, which exceeds its request of 0.
1785720Ki is
You might have to set a limit/request on the pod in the cron.
I agree , I could set a higher resource limit on the pod. But the pod was already using 1.8
GB of memory. I don't know if it is going to go way over and what that number would be.
The nodes default is 2.8
GB
If I set the pod max to be 2.0
Gb then the pod is going to being oom
killed, which we don't want now in the short-term.
If I set the pod mem max 4.0
Gb then it would not be scheduled because there aren't nodes that accommodate those requirements for memory.
Thoughts?
The pods are getting oom
killed with pod
presets for 1Gb
memory.
➜ badge git:(feat/scorecard-badge) ✗ k get events
LAST SEEN TYPE REASON OBJECT MESSAGE
3m47s Warning UnexpectedJob cronjob/daily-score Saw a job that the controller did not create or forgot: daily-score-manual-5pdwg
19m Warning NodeSysctlChange node/gke-openssf-default-pool-aeaa7a9c-bnbd
14m Warning SystemOOM node/gke-openssf-default-pool-aeaa7a9c-bnbd System OOM encountered, victim process: scorecard, pid: 1053112
14m Warning OOMKilling node/gke-openssf-default-pool-aeaa7a9c-bnbd Memory cgroup out of memory: Killed process 1053112 (scorecard) total-vm:2140268kB, anon-rss:1042320kB, file-rss:12820kB, shmem-rss:0kB, UID:0 pgtables:2180kB oom_score_adj:741
9m22s Warning OOMKilling node/gke-openssf-default-pool-aeaa7a9c-bnbd Memory cgroup out of memory: Killed process 1055912 (scorecard) total-vm:1802200kB, anon-rss:1042520kB, file-rss:12884kB, shmem-rss:0kB, UID:0 pgtables:2168kB oom_score_adj:741
9m22s Warning SystemOOM node/gke-openssf-default-pool-aeaa7a9c-bnbd System OOM encountered, victim process: scorecard, pid: 1055912
25m Warning NodeSysctlChange node/gke-openssf-default-pool-aeaa7a9c-bx9r
16m Warning NodeSysctlChange node/gke-openssf-default-pool-aeaa7a9c-wg5f
➜ badge git:(feat/scorecard-badge) ✗ k describe po daily-score-manual-5pdwg-9tzsp
Name: daily-score-manual-5pdwg-9tzsp
Namespace: default
Priority: 0
Node: gke-openssf-default-pool-aeaa7a9c-bnbd/10.128.0.30
Start Time: Thu, 11 Mar 2021 16:50:59 +0000
Labels: controller-uid=93481296-a298-40f6-9f5d-859899052764
job-name=daily-score-manual-5pdwg
Annotations: <none>
Status: Running
IP: 10.0.3.130
IPs:
IP: 10.0.3.130
Controlled By: Job/daily-score-manual-5pdwg
Containers:
run-score:
Container ID: docker://cade9e5af4d4b0c12ca6d8d93f582e4e64ad24820d9f862906a8be110d9fd910
Image: gcr.io/openssf/cron:latest
Image ID: docker-pullable://gcr.io/openssf/cron@sha256:7373afbae5c6df42faefac54bd7f4a717312325fa755f30211eb0b324da411e1
Port: <none>
Host Port: <none>
Command:
/bin/sh
-c
./cron/cron.sh
State: Running
Started: Thu, 11 Mar 2021 16:51:02 +0000
Ready: True
Restart Count: 0
Limits:
memory: 1Gi
Requests:
memory: 1Gi
Environment:
GITHUB_AUTH_TOKEN: <set to the key 'token' in secret 'github'> Optional: false
GCS_BUCKET: ossf-scorecards
BLOB_URL: gs://ossf-scorecards-cache
USE_BLOB_CACHE: true
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-jhqm8 (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
default-token-jhqm8:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-jhqm8
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events: <none>
Looks like we would have to increase the node size to have larger memory and remove the pod
presets for memory.
Fails with this repo github.com/ApolloAuto/apollo , thanks @dlorenc for the tip.
Describe the bug The scorecard
cron
job is evicted because the node was low on memory. The existingcron
job didn't use theblobcahe
and was using an in-memory cache.Expected behavior
blobcache
which was merged in #261.