newrelic / nri-kubernetes

New Relic integration for Kubernetes
https://docs.newrelic.com/docs/integrations/kubernetes-integration/get-started/introduction-kubernetes-integration
Apache License 2.0
44 stars 54 forks source link

V2 - NRI image 2-windows-ltsc2022-alpha fails to communicate with infrastructure-command + infra-sdk-cache.json is older than 1m0s #1145

Open david-garcia-garcia opened 9 hours ago

david-garcia-garcia commented 9 hours ago

NRI docker image for windows nodes does not seem to work at all.

Description

Using the V2 infrastructure intengration, as per the docs: https://docs.newrelic.com/docs/kubernetes-pixie/kubernetes-integration/installation/kubernetes-windows/

With these settings for the helm chart:

global:
  licenseKey: "${NEWRELIC_LICENSE_KEY}"
  cluster: "${CLUSTER_NAME}"
enableLinux: false
enableWindows: true
windowsOsList:
  - version: 2022
    imageTag: 2-windows-ltsc2022-alpha
    buildNumber: 10.0.20348
    windowsNodeSelector:
      kubernetes.io/os: windows
resources:
  limits:
    memory: 300M
  requests:
    cpu: 50m
    memory: 90M
tolerations:
  - operator: "Exists"
    key: "windows"
    effect: "NoSchedule"

When starting the container it will always print this error:

Commands initial fetch failed.\" component=AgentService error=\"command request submission failed: Get \\"https://infrastructure-command-api.eu.newrelic.com/agent_commands/v1/commands\\": EOF\" service=newrelic-infra

I can confirm this not a connectivity issue, the domain has been allowed in our outbound FW rules and I have manually tested that a connection can be established both from the windows node and from a windows pod. Maybe the image is so old that there is some TLS issue/version problem.

After some time, it starts spamming the following error:

Integration command failed\" error=\"exit status 1\" instance=nri-kubernetes integration=com.newrelic.kubernetes prefix=integration/com.newrelic.kubernetes stderr="time="2024-12-04T07:29:06Z" level=warning msg=\"Cache file (c:\var\cache\nr-kubernetes\infra-sdk-cache.json) is older than 1m0s, skipping loading from disk.\"

Although the windows nodes are showing in Infrastructure -> Hosts, no information whatsoever is shown from them except for the agent version (1.20.7) and the apps running in the node (I presume inferred from the APM information)

The image is running agent from November 2, 2021:

https://docs.newrelic.com/docs/release-notes/infrastructure-release-notes/infrastructure-agent-release-notes/new-relic-infrastructure-agent-1207/

Also, hostProcess was released with K8S 1.26 in 2022. Maybe that can be used to allow the NRI integration capture system, network and disk metrics, which are currently not part of the integration.

https://kubernetes.io/docs/tasks/configure-pod-container/create-hostprocess-pod/

Expected Behavior

workato-integration[bot] commented 9 hours ago

https://new-relic.atlassian.net/browse/NR-345947