Open mmariani opened 5 years ago
This is mostly an artifact of how you do monitoring. We use Prometheus, so it's pull based and suffers from many issues like those described. Longer term, this is something that needs be addressed by the monitoring stack no matter what, for it's own purposes, and for ours. We're moderately aware of this issue and we're working with the Openshift monitoring team to discuss how Prometheus can handle this better.
In my past discussions we've discussed a few things that might be something that can be done:
I've tried to think of decent ways of using pushgateway also, but haven't really come up with any decent ideas.
It is my understanding that a pod's metrics are not retained upon its death, therefore jobs running for short periods (i.e. less than the scraping interval) are likely to pass under the radar of any current metering effort. By contrast, public cloud platforms are able to provide precise accounting and throttling of resources (cpu credits), although I suppose at the expense of running customized hypervisors.
Today, a customer of a container platform can run a workload in the form of a lot of small jobs and its resource usage will be underestimated, am I right? Are you aware if this is a limitation that k8s may overcome, and if there's any documented effort to work in that direction? I am aware this may depend on the container runtime as well.
Thanks