rancher / windows

Rancher Windows Team project repository.
Apache License 2.0
11 stars 6 forks source link

[Monitoring V2] no metrics from windows nodes available in Grafana when win_prefix_path is set on a windows cluster #79

Open sowmyav27 opened 3 years ago

sowmyav27 commented 3 years ago

What kind of request is this (question/bug/enhancement/feature request): bug

Steps to reproduce (least amount of steps as possible): on 2.5.8-rc18

Expected Result: Metrics from windows nodes should be available in Grafana.

Other details that may be helpful:

Environment information

aiyengar2 commented 3 years ago

Reproduced in the latest Monitoring chart. Screenshots and logs attached below for investigation.

Seems like an issue with the wins cli proxy command on the wins server side. Filing a related issue with rancher/wins to track.

Redeploying the wins client with windows-exporter itself doesn't seem to resolve the issue as it produces two more of the same exact logs:

Handling backend connection request [rancher-monitoring-windows-exporter-5q4nz]
error in remotedialer server [500]: connect not allowed

Screen Shot 2021-05-04 at 6 41 03 PM

Screen Shot 2021-05-04 at 6 43 30 PM

check-wins-version logs

Detected wins version on host is v0.1.0, which is >v0.1.0. Continuing with installation...

exporter-node logs

time="2021-05-05T01:39:04Z" level=warning msg="No where-clause specified for service collector. This will generate a very large number of metrics!" source="service.go:41"
time="2021-05-05T01:39:04Z" level=error msg="Failed to start service: The service process could not connect to the service controller." source="exporter.go:350"
time="2021-05-05T01:39:04Z" level=info msg="Enabled collectors: system, cpu, net, os, logical_disk, tcp, container, service, cs, memory" source="exporter.go:360"
time="2021-05-05T01:39:04Z" level=info msg="Starting windows_exporter (version=0.15.0, branch=master, revision=cdbb27d0b4ea9810dc35035fad281fe6468b7dd1)" source="exporter.go:412"
time="2021-05-05T01:39:04Z" level=info msg="Build context (go=go1.15.3, user=appveyor-vm\\appveyor@appveyor-vm, date=20201107-08:23:37)" source="exporter.go:413"
time="2021-05-05T01:39:04Z" level=info msg="Starting server on :9796" source="exporter.go:416"

exporter-node-proxy logs

INFO[2021-05-05T01:40:36Z] Connecting to proxy                           url="ws://rancher_wins_proxy"

wins service logs (host)

PS C:\Users\Administrator> Get-EventLog -LogName Application -Source rancher-wins -ErrorAction Ignore | Sort-Obj
ect Index | %{ $_.Message }
Stackdump - waiting signal at Global\stackdump-3592
Listening on \\.\pipe\rancher_wins_proxy
Listening on \\.\pipe\rancher_wins
currentVersion.Major > versionRange.MaxVersion.Major: 11, 9
currentVersion.Major > versionRange.MaxVersion.Major: 11, 9
currentVersion.Major < versionRange.MinVersion.Major: 11, 12
currentVersion.Major > versionRange.MaxVersion.Major: 11, 10
currentVersion.Major < versionRange.MinVersion.Major: 11, 12
currentVersion.Major < versionRange.MinVersion.Major: 11, 13
currentVersion.Major > versionRange.MaxVersion.Major: 11, 9
currentVersion.Major > versionRange.MaxVersion.Major: 11, 10
currentVersion.Minor < versionRange.MinVersion.Major: 10, 11
currentVersion.Major < versionRange.MinVersion.Major: 11, 12
currentVersion.Major < versionRange.MinVersion.Major: 11, 13
currentVersion.Major < versionRange.MinVersion.Major: 11, 13
currentVersion.Major > versionRange.MaxVersion.Major: 11, 9
currentVersion.Major > versionRange.MaxVersion.Major: 11, 9
currentVersion.Major < versionRange.MinVersion.Major: 11, 12
currentVersion.Major > versionRange.MaxVersion.Major: 11, 10
currentVersion.Major < versionRange.MinVersion.Major: 11, 12
currentVersion.Major < versionRange.MinVersion.Major: 11, 13
currentVersion.Major > versionRange.MaxVersion.Major: 11, 9
currentVersion.Major > versionRange.MaxVersion.Major: 11, 10
currentVersion.Minor < versionRange.MinVersion.Major: 10, 11
currentVersion.Major < versionRange.MinVersion.Major: 11, 12
currentVersion.Major < versionRange.MinVersion.Major: 11, 13
currentVersion.Major < versionRange.MinVersion.Major: 11, 13
could not get checksum for "c:\\etc\\rancher\\wins\\wins.exe": open c:\etc\rancher\wins\wins.exe: The process ca
nnot access the file because it is being used by another process.
could not get checksum for "c:\\etc\\rancher\\wins\\wins.exe": open c:\etc\rancher\wins\wins.exe: The process ca
nnot access the file because it is being used by another process.
Handling backend connection request [rancher-monitoring-windows-exporter-ldck9]
error in remotedialer server [500]: connect not allowed

named pipes (host)

PS C:\Users\Administrator> (get-childitem \\.\pipe\).FullName
... (omitted) ...
\\.\pipe\rancher_wins
\\.\pipe\rancher_wins_proxy
... (omitted) ...
aiyengar2 commented 3 years ago

Possible Workaround

Just deploying rancher-wins-upgrader (e.g. re-initializing the wins service) seems to be an effective workaround to this issue.

I'm not sure whether this is because the fix in wins v0.1.1 somehow resolves this bug (doubtful) or whether the re-initialization of wins is what fixes the issue, since that would cause the named pipe + GRPC server + network configuration of the host to be re-initialized.

@sowmyav27 once rc19 is cut with wins v0.1.1, can you retest this issue to see if that resolves it?


Screen Shot 2021-05-04 at 7 12 40 PM

Jono-SUSE-Rancher commented 3 years ago

@sowmyav27 & @aiyengar2 - We are doing some triage right now of issues in 2.6. Would you be able to give us more information about this? Is this fixed in the latest RC? And Arvind, how does the workaround look as a viable option? (You mentioned it was a possible workaround).

aiyengar2 commented 3 years ago

@Jono-SUSE-Rancher I don't believe this is fixed in the latest RC.

The core problem here seems to be that a Windows cluster without rancher-wins-upgrader deployed that mounts resources on a prefixPath (e.g. c:\host\opt; this is specified as part of the RKE1 config) does not seem to be able to accept proxy connections via the Named Pipe mounted at \\.\pipe\rancher_wins_proxy.

This issue appears to be resolved when the wins service is restarted and/or the wins config is refreshed, which is exactly what happens when you deploy rancher-wins-upgrader.

I'm not sure why this restart is required so this needs to be investigated. The problem could be with the way we do bootstrapping on Windows nodes (e.g. how we set up the config + service) or could require cutting a new wins release. Either way, this would be a Windows issue that is not particular to Monitoring (cc: @sirredbeard ).

Currently, only Monitoring is impacted since only monitoring uses wins cli proxy, but I believe there are conversations about using that feature in other Windows components (cc: @rosskirkpat), so this does need to be eventually prioritized.

However, if we cannot prioritize this in 2.6, the workaround of expecting rancher-wins-upgrader to be deployed onto Windows clusters with prefixPath enabled sounds like a viable option to me. I think we should encourage customers to start using it anyways so that they can have declarative wins configs (i.e. an expectation that the upgrader chart exists would allow us to more easily cut wins releases in the future, if we need to add security fixes, golang bumps, or new features). @luthermonson @sirredbeard any thoughts here?

Either way, if we prioritize the workaround, I think we should ensure that it is tested rigorously to ensure that we don't miss anything before suggesting it as the official solution to this issue.

deniseschannon commented 3 years ago

@sowmyav27 @aiyengar2 Could this be related to the fact that no metrics are available in grafana for k8s 1.21?

https://github.com/rancher/rancher/issues/33465

aiyengar2 commented 3 years ago

@deniseschannon that should be unrelated. https://github.com/rancher/rancher/issues/33465 is Monitoring V1; this is Monitoring V2.