CPU usage graphs not useful at a glance

abrenneke commented 6 years ago

Environment

Dashboard version: gcr.io/google_containers/kubernetes-dashboard-amd64:v1.7.1
Kubernetes version: 1.8
Operating system: CentOS 7 (hosts)

I recently installed Heapster et. al. into our Kubernetes dashboard, and our ops guy had some feedback on the usefulness of the CPU graphs - namely that beyond being able to tell if one pod has a fluctuating CPU usage, comparing them does not result in anything useful.

He provided these images to explain:

chart_for_pods

chart_for_general_cpu_usage

Node0 commented 6 years ago

The graph thumbnails appear to all be scaled to relative (to their own duration) on the y axis. This makes each pod's graphical display appear to report it's CPU history as relative only to its last n minutes and not relative to the total amount of compute available on that node, which when we see them (the pods) stacked in rows like they are shown, immediately appear (graph affordance) as though they are relate-able (as they all share the shape language of the graph) but that appears to be completely misleading since these graphs are all scaled to their own relative last N minutes. It makes comparing the pods on a machine to size up which is busiest amongst them (in terms of CPU) essentially meaningless, thus defeating the whoe point (beyond having a trailing glimmer of an idea of how 'relatively' busy a given pod as been compared to its own invisible (because we can't see back before the time window of the thumbnail) relative busy-ness.... to itself? Not useful.

The memory metrics are somewhat useful, but again, the graphs appear to be relatively scaled (making cross-comparison between pods' memory consumption equally difficult).

The memory numerical displays at least have the proper units, allowing some quick sizing up of the memory consumption situation on the node, however the numerical display for processor utilization.... is at best, off by a couple of places to the left of where the decimal point should be, as well as missing the % symbol, if these are in fact utilization-percentages of total available CPU resources on the node. Can you clarify as to what those per-pod cpu metrics actually represent?

How can we configure heapster to display this information more meaningfully? Was this intentional? If so please explain the thinking behind this, as it is quite useless (beyond the memory metrics) at present for the expectations of a production environment.

Is this lack of functionality yet to be implemented, or not planned to be implemented? If it's the latter, can you point us to where these calculations take place and where we can fix this in the code?

Please advise. -- Joe Hacobian.

cheld commented 6 years ago

Wow, great feedback. We will have to think about how to deal with it. A quick comment:

The CPU metric is milli cores - as everywhere in Kubernetes. We norm to cores. So instead of 50.000, we display 50. Larger numbers get prefixes like k, M

https://github.com/kubernetes/dashboard/blob/master/src/app/frontend/common/filters/cores.js

In terms of chart y-axis I can imagine that the thumbnail is not helpfull without reading the numbers. So the thumbnails miss a bit of their purpose. It probably would make sense to scale the y-achis to the quota of the resource. e.g. a resource has a quota of 1.000 cores and is running at 500.000 milli cores. Than the graph could be scaled to 50%. This is just an idea and needs more consideration. (Also, quota is recommended to use, but optional)

A general UX problem might be the mental jump from infrastructure-centric to cluster-centric.

CC @raulhide, @danielromlein

abrenneke commented 6 years ago

Scaling the y-axis to the request or limit of the pods would definitely be an improvement. Each container has its own request and limit though, so that might be misleading - for two containers with equal limits, 50% of the total limit might mean one container is at 100% of its limit and the other at 0% - still unhelpful. Greater aggregations like deployments would fuzzy that even more.

The relative CPU usage between two pods still wouldn't be something you could see - though I'm not sure if Kubernetes as a whole wants to make that concept moot.

Could you explain infrastructure-centric vs cluster-centric a bit more?

cheld commented 6 years ago

Could you explain infrastructure-centric vs cluster-centric a bit more?

In other words, what could we do for first time users to feel comfortable quickly

The relative CPU usage between two pods still wouldn't be something you could see

Of course this makes only sense if the pods are located on the same node. I could imagine this only as an extra information on the node detail page.

rahuldhide commented 6 years ago

I have created this design proposal to address the UX issues that are discussed in this thread. Let me know your feedback. pods_ux proposal

floreks commented 6 years ago

Really nice feedback and very detailed. I do agree with some points, mainly that it's not easy to drill down and find the cause of the problem. Short metrics timeframe does not really help with troubleshooting. On the other hand, we have to fit into kubernetes architecture and information provided by metric providers (i.e. heapster). It might not be possible to show uniform percentage value in some cases.

CPU Overview graph (top one)

The main difficulty here is that to be consistent we need to find a way to show usage across containers that may be deployed on different nodes with different number of CPU cores. Kubernetes has chosen here to use Cores as unit. On the screenshot from the first post, we can see that it scales up to 0.113 (Cores). This means that peak CPU usage in given time across nodes was 113 Millicores. It is roughly 11.3% of a single core usage but spread across all cores taken into consideration. It would be really hard to determine actual percentage usage of cores spread across multiple nodes here. This is also the way kubectl top displays metrics.

Currently, top graphs show summary usage of resources visible on given page. It might be only resources from a single namespace or all namespaces. On a node list/detail page, there are graphs showing actual node utilization metrics taken from heapster.

CPU sparklines (pod-level)

Showing percentage usage here also might be difficult. There are couple of options that quickly come to my mind.

Usage scaled based on defined pod limits. Pod can use up to 300m (cores). Current usage is 30m. We could show 10%.

Pod limit is not defined. We have to scale usage based on the limits of the node (number of cores). It can be scaled in a 2 ways. Let's consider use case where we have 4 available cores. Pod is using 300 Millicores. Scaling based on all available cores would result in 7.5% usage. Scaling based on 1 core would result in 30% usage. The second option would show over 100% if 1 pod would use over 1 core. This is actually how top command works.

Right now Dashboard is pretty much a raw reflection of what heapster offers. We can ask for cpu/mem usage of a given pod or of a node. Results are in either cores (CPU) or bytes (mem).

Sparklines and graphs were intended to show if there are any big changes in cpu/mem usage over this short period of time (15 min). If it is fully colored then it means that usage is steady and did not change. Additionally, there is a usage in cores displayed nearby so user can see that this pod in example uses 0.3 Core (30% of a single core on the node).

I agree that this needs some more work, however, some user expectations might be misleading and not possible to be implemented. It could cause even more confusion. The first step is to understand what information is available (through API) and how it is related to what is shown in the UI (graphs, sparklines, metrics).

rahuldhide commented 6 years ago

@floreks I like your ideas about percentages and calculations.

We may not have all the data and logic implemented currently but let’s think about how can we make these graphs more useful for the users so that they would feel confident while using the system. We need to define the story that UI should tell or the questions that UI need to answer through the data visualization. Since the context in this case is Pods, we can focus on-

What is the status of my pods?
What is the status of my containers?
Do I have any containers that are consuming excess resources (anomalies)?
Do I have any containers that require additional resources?

We can expand this list further and choose the meaningful information from the available dataset to answer these questions. I would prefer to add this issue in the UX improvements as it will require further discussions and design iterations.

rahuldhide commented 6 years ago

@floreks using limits and requests for containers is a common practice. Users specify these resource specs to make sure that containers are running properly. We can use the sum of the container limits and current usage can be used for dataviz. I think we can have percentage and core/bytes.

If the limit is not specified for a Container, then chart rendering may become complicated as-

The Container could use all of the memory/CPU available on the Node where it is running
The Container could use the default limits, specified for the namespace.

Thoughts?

kubernetes / dashboard