newrelic / newrelic-quickstarts

New Relic One quickstarts help accelerate your New Relic journey by providing immediate value for your specific use cases.
https://newrelic.com/instant-observability/
Apache License 2.0
111 stars 301 forks source link

Nodes chart query in Kubernetes dashboard needs to also select from K8sNodeSample #2363

Open alaiavee opened 6 months ago

alaiavee commented 6 months ago

Issue Summary

Nodes chart query in Kubernetes dashboard needs to also select from K8sNodeSample in order to return the node metrics being selected.

Description

The Nodes chart in the Kubernetes Dashboard is attempting to query node metrics (in addition to pod metrics), but is currently only selecting from K8sPodSample. In order to return the selected mem, cpu, and fs util% metrics, the query also needs to select from K8sNodeSample. As it is right now, the query will only report the running and pending pod values. The node level metrics are not reported.

Code link: https://github.com/newrelic/newrelic-quickstarts/blob/6843ec92b598c4f1b65469b0fc8827c383292350/dashboards/kubernetes/kubernetes.json#L586

Steps to Reproduce

Expected Behavior

The way the query is written, it is expected that it returns the node level metrics, which means the query needs to select from K8sNodeSample.

This can be done in a couple ways.

1) In the main select, select from both K8sPodSample and K8sNodeSample, then use filter() and filter by eventType() for each of the selected values. For example, the 'Node Capacity and Utilization' chart in the Kubernetes Cluster Overview uses this method with the following:

FROM K8sNodeSample, K8sPodSample 
select filter(latest(allocatablePods), where eventType() = 'K8sNodeSample') as 'Allocatable Pods'
  , filter(uniqueCount(podName), where eventType() = 'K8sPodSample' and status = 'Running' and createdKind != 'Job') as 'Running Pods'
  , filter(uniqueCount(podName), where eventType() = 'K8sPodSample' and status = 'Pending' and createdKind != 'Job') as 'Pending Pods'
  , filter(uniqueCount(podName), where eventType() = 'K8sPodSample' and status = 'Running' and createdKind != 'Job') / filter(latest(allocatablePods), where eventType() = 'K8sNodeSample') * 100 as 'Pod Capacity %'
  , filter(average(allocatableCpuCoresUtilization), where eventType() = 'K8sNodeSample') as 'Avg. CPU %'
  , filter(average(allocatableMemoryUtilization), where eventType() = 'K8sNodeSample') as 'Avg. Mem %'
  , filter(max(fsCapacityUtilization), where eventType() = 'K8sNodeSample') as 'Max. FS Util %' 
facet if(nodeName != '', nodeName, 'NoNodeAssigned') as 'Node Name' limit 2000 

2) Use a subquery join.

Example:

SELECT filter(uniqueCount(podName), WHERE (status = 'Running')) AS `Running Pods`
  , filter(uniqueCount(podName), WHERE (status = 'Pending')) AS `Pending Pods`
  , ((average(`k8s`.`node.cpuUsedCores`) / average(`k8s`.`node.allocatableCpuCores`)) * 100) AS `CPU %`
  , ((average(`k8s`.`node.memoryWorkingSetBytes`) / average(`k8s`.`node.allocatableMemoryBytes`)) * 100) AS `Mem %`
  , ((average(`k8s`.`node.fsUsedBytes`) / average(`k8s`.`node.fsCapacityBytes`)) * 100) AS `Disk Util %` 
FROM K8sPodSample LEFT JOIN (
  FROM K8sNodeSample 
  SELECT latest(cpuUsedCores) AS `k8s.node.cpuUsedCores`
    , latest(allocatableCpuCores) AS `k8s.node.allocatableCpuCores`
    , latest(memoryWorkingSetBytes) AS `k8s.node.memoryWorkingSetBytes`
    , latest(allocatableMemoryBytes) AS `k8s.node.allocatableMemoryBytes`
    , latest(fsUsedBytes) AS `k8s.node.fsUsedBytes`
    , latest(fsCapacityBytes) AS `k8s.node.fsCapacityBytes`
  FACET nodeName, entityGuid LIMIT MAX
  ) ON nodeName
WHERE (NOT (createdAt IS NULL) AND NOT (nodeName IS NULL))
SINCE 30 MINUTES AGO 
FACET nodeName LIMIT 2000

Relevant Logs / Console output

N/A

Your Environment

Kubernetes Quickstart dashboard: https://github.com/newrelic/newrelic-quickstarts/blob/release/dashboards/kubernetes/kubernetes.json#L586

Additional context

N/A

github-actions[bot] commented 3 months ago

Old issues will be closed after 105 days of inactivity. This issue has been quiet for 90 days and is being marked as stale. Reply here to keep this issue open.