Open roobre opened 3 years ago
We are seeing a similar behaviour while running Version Checker. Would be interested to know if there are recommended values for the limits?
Also seeing something similar with Version Checker getting OOM killed fairly frequently.
Hey @Trede1983 @trastle @roobre,
Sorry its taken so long to get back to you on this issue... I have noted that there were some issues around version-checker since these issues have been raised in attempting to reduce the memory footprint.
Things like this are extremely challenging to debug and replicate and it would be amazing to know how many nodes/pods you have in the cluster at the time of this issue, along with the memory/cpu limits/requests you had/have set.
I appreciate that this may be some time ago, and that you may no longer be using version-checker, however this information could be really helpful for us to further understand the memory footprint in larger installations.
In terms of tuning through and changes the main one that comes to mind is #160 along with the already mentioned #69
test-all-containers
and adding the enable.version-checker.io/*my-container*
annotations to pods that you care about--image-cache-timeout
cli arguments.Hello @davidcollom
I'm also encountering this issue. My test cluster is pretty small:
Flag --test-all-containers
is set, and only two pods have enable.version-checker.io/*my-container*: false
annotation to disable verification (comes from a private registry I haven't configured yet).
I also defined use-sha.version-checker.io
, match-regex.version-checker.io
and override-url.version-checker.io
on a bunch of pods, as some images comes from a registry proxy, or have "fake" versions (like grafana).
Version checker is the latest (0.7.0) and installed using helm with the following values:
replicaCount: 1
versionChecker:
imageCacheTimeout: 30m
testAllContainers: true
resources:
# limits:
# memory: 128Mi
requests:
cpu: 10m
memory: 128Mi
# This is a temporary fix until the following PR is merged:
# https://github.com/jetstack/version-checker/pull/227
ghcr:
token: xxxx
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
readOnlyRootFilesystem: true
runAsNonRoot: true
runAsUser: 65534
seccompProfile:
type: RuntimeDefault
serviceMonitor:
enabled: true
If I set resource.limit.memory
, version checker is oomkilled every ~6h. I haven't tried running it for more than 1 day without the limit, but I assume it will keep growing.
Here is a graph showing the memory usage over time:
Hello,
Version 0.8.2, the issue still persists. I tried to add the following to the values:
env:
- name: GOMEMLIMIT
valueFrom:
resourceFieldRef:
divisor: "0"
resource: limits.memory
It reduce the frequency of OOMKill to about 1 per day instead of every 6h, but doesn't solve the issue.
I am running
version-checker
on a single node, quite small cluster with ~60 pods. So far it is working nicely, but I do not understand the memory behavior it has.I'm basically running the sample deployment file, plus the
--test-all-containers
flag and some cpu limits:kubectl get pod -o yaml
Over time, I see that
version-checker
approaches the memory limit and then stays near ~99% for a while. After some time, the kernel kills the ct due to OOM and k8s restarts the pod.However, I do not see anything alarming in the logs, other than some failures and expected permission errors.
This doesn't seem to have any functional impact, but does fire some alerts and doesn't look good on my dashboards :)
Is this behavior intended, and/or is there any way to prevent it?