prometheus-community / windows_exporter

Prometheus exporter for Windows machines
MIT License
2.86k stars 684 forks source link

windows_exporter service failed to start on reboot #551

Closed f1-outsourcing closed 1 year ago

f1-outsourcing commented 4 years ago

After updates and rebooting the server, the windows_exporter service was not running

The windows_exporter service failed to start due to the following error: The service did not respond to the start or control request in a timely fashion.

When I look at the recovery options of the windows_exporter service they are not as other 'standard' windows services. Looks like none has set reset fail count after:0 and restart service after: 0

exporter: exporter

other examples: workstation server firewall nla

I am not really an expert on the settings of recovery of services, but maybe someone should look at these. Maybe it is better to put this minutes on 3 or 5?

https://docs.microsoft.com/en-us/archive/blogs/jcalev/some-tricks-with-service-restart-logic https://social.microsoft.com/Forums/ro-RO/3db76753-4607-4a20-97a0-790c73e379cc/the-actions-after-system-service-failure?forum=winserver8gen

vvvvoid commented 4 years ago

After updates and rebooting the server, the windows_exporter service was not running

+1, we had to restart service after updates

I think startup type should be Automatic(Delayed start) instead of Automatic

carlpett commented 4 years ago

@f1-outsourcing Note that those services have a "Subsequent failures" set to "Take no action", meaning it will simply stop trying if it fails to start twice. The first number, reset after, doesn't matter much when subsequent failures is set to restart. We could possibly set the restart interval to something higher to space restarts out, but before this report, we've never heard of this being a problem. As I've asked in the other issue, any logs that can be found about why it is failing is crucial to solving this, rather than attempting to work around it by changing settings. The exporter really shouldn't need anything else to be running to be able to start, so without any indication what is going wrong, we can't really troubleshoot it.

daltonjprice commented 4 years ago

@carlpett I've also noticed this behavior. I'm able to reproduce this consistently by rebooting one of the servers I manage. I'd fully expect to see logs in the "Application" event queue from the source "windows_exporter" when the service fails to start, but I don't. All I see is the same thing reported by @f1-outsourcing. Events are created for the service failing to start due to a timeout.

It's also worth noting that I've seen this issue on pretty much all 200~ windows machines we have.

See the following screenshots:

The service fails to start due to timeout: image

The service manager fails the service: image

The application event queue has no windows_exporter entries in this time period: image

Should I circumvent event viewer? I know stdout is a logger option but I didn't see an option to log to a flat file. If you've got some ideas for troubleshooting this I'd be willing to run whatever is needed. This issue has been quite troublesome for us during patching.

babunatarajan commented 4 years ago

I do have the same issue on Windows Server 2016. EventViewer Warning: " Collection timed out, still waiting for [cs os service] " windows_exporter (version=0.13.0, branch=master, revision=c62fe4477fb5072e569abb44144b77f1c6154016)

cb3inco commented 4 years ago

Same issue here.

bpickhardt commented 4 years ago

Same issue on Server 2019. Fresh installed machines running the windows_exporter agent do not start the agent on reboot. Playing with the automatic restart options did not resolve the issue.

JDA88 commented 4 years ago

I think startup type should be Automatic(Delayed start) instead of Automatic

I agree. and at the very least you should have a the First and Second failure set to Restart the Service with a delay of 1min

advorsky73 commented 4 years ago

same issue on windows 8.1

carlpett commented 4 years ago

I've still been unable to reproduce this, unfortunately, so anything you can find about why it is happening on your systems, but not all, would be useful. The only thing that can fail during startup in the exporter code is really where we bind to the network interface, so potentially if the network hasn't come up yet. That'd lead to the exporter exiting though, not a timeout...

@babunatarajan You seem to have a completely different issue, since your error is a timeout during metric collection from a running exporter.

advorsky73 commented 4 years ago

@carlpett if i see this correctly, it works with Delayed start, so i my best guess is that the windows_exporter service starts and immediately exits again during its first try, probably because a dependency is not fulfilled at that early stage of boot time. maybe the network, i dont know. however the service after installation shows no dependency, neither restart options are set, so one fail during start and it stays off, which is not good... my suggestion: Installer change to make the service Automatic (Delayed) and set 1 day clear, 5 minutes each retry as restart options. then this will work.

babunatarajan commented 4 years ago

I already set the Delayed Start as soon as it failed to start at the boot, but never really tested just because it is prod environment. Did someone set the Delayed Start and rebooted the server? if it works we can keep this as a workaround.

Thanks

bpickhardt commented 4 years ago

I set my servers to delayed start and it seemed to at least start correctly when Windows started up. I'm unsure if it would restart on failure correctly or not though.

carlpett commented 4 years ago

There's a lot of different threads flying here, and a few misconceptions. First off, regarding restarts. We already configure service to restart on failure, and delay the restarts by five seconds. This is visible via sc qfailure windows_exporter, but the Services UI appears to only work with minutes, so it shows zero (it would probably make sense to bump this to 60 seconds to reduce confusion)

Then, on the topic of Delayed starts. I'm not in principle against it (it will mean you will not have metrics for ~2 minutes longer than otherwise after a reboot, but that is probably not a huge deal in most cases), but there seems to be a mixed bag of experiences reported on whether it helps or not. I've now tried booting completely without networking and related services enabled, and it does not appear to prevent the windows_exporter from starting. So there's something deeper going on. Are any of you overriding the service account for the service, so you could have a dependency on Active Directory being available?

bpickhardt commented 4 years ago

The 2019 machines I was seeing the problem on are AD joined and hardened with the CIS guidelines. I never had issues last year when I was still using Windows Server 2012 R2 and an older version of the exporter with the service starting correctly on reboot so maybe it's a 2019 Server issue?

SupraOva commented 4 years ago

Hi everyone,

I was able to get through this issue by running this command :

Delayed start

sc config windows_exporter start= delayed-auto

Restart option

sc failure windows_exporter actions= restart/60000/restart/60000/""/60000 reset= 86400

Tested on Windows Server 2012 R2 / 2016 / 2019.

Hope its help.

dry4ng commented 4 years ago

I have the same problem on freshly provisioned Azure Windows VMs: windows_exporter fails to start after VM reboot.

enabled_collectors: "cpu,cs,iis,logical_disk,memory,net,os,service,system"

josephB commented 4 years ago

solved for me with a folder exclusion rule on Windows Defender use of windows_expoter v0.13 Problem appears with August Windows update on Windows 2016 servers

chinhodado commented 4 years ago

Same issue here.

The windows_exporter service failed to start due to the following error: The service did not respond to the start or control request in a timely fashion.

A timeout was reached (180000 milliseconds) while waiting for the windows_exporter service to connect.

I can confirm setting the service to Delayed Start fixed the issue. Why can't this be set to Delayed Start by default?

majerus1223 commented 4 years ago

@josephB Good call on the exclusion, in our case looks like our AV tools needed an exception following aug updates.

carlpett commented 4 years ago

@chinhodado As I mention in my comment above, it doesn't seem to solve it very reliably. If we could figure out why it fixed it for you, that'd be a big step forward towards making a change. If it is related to antivirus starting up, as indicated by some other commenters lately, we'd be much better served by setting the correct service dependency.

majerus1223 commented 4 years ago

Ill see if i can get more detail.

dry4ng commented 4 years ago

Setting delayed start doesn't help. Until it's fixed, I'm using a scheduled task which starts windows_exporter if it's not running every 5 mins.

carlpett commented 4 years ago

@dry4ng It'd be interesting to see if your case is solved with an exception in Windows Defender as mentioned above?

bpickhardt commented 4 years ago

In my case, almost all my Windows Server 2016/2019 machines will start the service with the automatic delayed startup after a reboot. I seem to always have a few that do not and I have to go manually start them once I get alerted. I can confirm that I've removed the Windows Defender feature from my Windows 2019 servers because I am using a third-party AV software. I was also thinking of having some kind of work around to start up the service when it is stopped but had been hesitant to put one in place so far.

chinhodado commented 4 years ago

Is there any log that we can look at to debug why the service doesn't start? AFAIK the service doesn't generate any log file.

bpickhardt commented 3 years ago

I installed 0.15 yesterday because I noticed added a dependency for the Windows service on the WMI service. I experienced the same problem where the service would not start with 0.15 when the start up type is set to Automatic. When I changed the start up type to Automatic (Delayed Start) after upgrading to 0.15 the service did start correctly after a reboot.

I noticed looking in the event viewer that the windows_exporter service did start but had problems collecting metrics, and I guess stopped itself, before the event that says the "Windows Management Instrumentation" service was started. Maybe this is the service that should be the dependency instead of or in addition to "WMI Performance Adapter"?

bpickhardt commented 3 years ago

I decided to test my theory about changing the service dependency to the "Windows Management Instrumentation" service. I changed the service start up type back to automatic from delayed start and then changed the dependancy from the "WMI Performance Adapter" to the "Windows Management Instrumentation" service. I then restarted 5 times and verified that the windows_exporter service was started each time.

After that for sanity checking I changed the dependency back to the "WMI Performance Adapter" and then reboot. On that reboot the windows_exporter service however did start correctly. I then decided to see if rebooting again would have the same result and it did. I'm therefore not sure if chaning the dependancy is going to solve this problem or not. I would think though that depending on the WMI service directly would probably be a better idea as the performance adapter service on my system is set to manual start and I observed it was not starting up when I removed the windoes_exporter dependancy on it so this dependancy is starting an additional service that was not previouslly running on my system.

I was testing on a Windows 2019 machine. Here are the commands I ran to change the service back to auto and then change the dependency to the WMI service itself instead of the performance adapter. Maybe someone else could do further testing to see if they are able to reproduce the error. If I had to take a random guess here, I think the problem would be more likely to occur on systems where it takes longer to start up the services on boot. My system is pretty quick to reboot and it only sometimes fails to start the windows_exporter service, usually after a Windows update is installed for example it fails.

sc.exe config windows_exporter start= auto
sc.exe config windows_exporter depend= Winmgmt
dcepulis commented 3 years ago

I can confirm that my company also experiences same issue with windows_exporter 0.15.0 on Windows server 2016. The last stable release which did not cause this was wmi_exporter 0.9.0. The trend what I have noticed is that exporter fails to start only after windows updates, if you perform normal reboot it works just fine.

@bpickhardt I am going to test you proposal about Windows Management Instrumentation decency on our prod servers. We do not perform windows updates on all machines at the same time so I can provide my findings this week.

tbiles commented 3 years ago

We are starting to do more extensive testing of windows_exporter 0.15.0 and are noticing similar trends as mentioned above. We have a pool of 9 test servers ranging from 2008R2 - 2019, including 2019 Core. The problem is that there doesn't seem to be any indicator of the service stopping that I can find in the event logs which leads me to believe it isn't always starting after a reboot.

I'm investigating one 2016 server now. Here is the last time it shows in the application log as started: 69489 Jan 28 16:59 Information windows_exporter 100 Starting server on :9182

Uptime on server is approximately 6 days, 9 hours, which means it rebooted on 2/10. Get-Hotfix shows a software update applied on 2/10, so than lines up with prior data indicating that Software Update/Reboot causes the service not to start. I'm trying to find further evidence of this on other OS versions as about 2/3 of our exporters in our test group aren't running at this current time. The only events showing on the system are in the Application log. I couldn't find anything in System log.

I should also note that the service seems to start fine if I manually start it or reboot the server without any updates in progress or being applied to the system.

Is there any setting in exporter to turn on debugging so it logs more in the event log?

tbiles commented 3 years ago

Regarding the service not starting, I was able to correlate errors with starting with patch times, so there is clearly an issue with the service after a Windows update. Sorry, this is truncated because of powershell, but this looks similar to what others have reported.

308084 Feb 10 00:09 Error Service Control M... 3221232472 The windows_exporter service failed to start due to the following error: ... 308083 Feb 10 00:09 Error Service Control M... 3221232481 A timeout was reached (30000 milliseconds) while waiting for the windows_exporter service to connect.

JDA88 commented 3 years ago

I didn’t looked at how the service start but usually the best practice for a windows services is something like this: Status: Starting (you have a limited time in that mode)

Status: Started

I think WindowsExporter is actually doing too much work on the Starting state and should probably move stuff to after the Started state. That way if something is not present on the first phase it as a change to do some retry and wait for a longer timeout before exiting. The other benefit is that it will be easier to do logs on the second phase, it’s really not good that the service fail to start or initialize without any error message.

carlpett commented 3 years ago

Thanks for the updates and suggestions, all. The thing that has us stumped here is that we haven't managed to find a way to extract any information at all from the startup phase when this happens. One possibility is that there is something in the connection to the Windows Event Log? If anyone who can reliably reproduce this could test if it works when removing the log settings, that'd be interesting.

To the direct question from @JDA88, we don't really do much at all before responding to the service manager apart from checking some configuration, but thinking on it, we probably don't really need to do that, we could probably do it immediately. I could push up a branch that does that if anyone would be willing to test?

tbiles commented 3 years ago

This is hard to reproduce quickly because Microsoft updates is essentially 1x per month requiring reboot, so it takes 1 month between tests.

Does anyone have any experiences to share with setting the service to Automatic(Delayed Start) instead of Automatic which is the default? That isn't something I've wanted to have to change unless absolutely necessary.

Tim

On Mon, Feb 22, 2021 at 12:46 PM Calle Pettersson notifications@github.com wrote:

Thanks for the updates and suggestions, all. The thing that has us stumped here is that we haven't managed to find a way to extract any information at all from the startup phase when this happens. One possibility is that there is something in the connection to the Windows Event Log? If anyone who can reliably reproduce this could test if it works when removing the log settings, that'd be interesting.

To the direct question from @JDA88 https://github.com/JDA88, we don't really do much at all before responding to the service manager apart from checking some configuration, but thinking on it, we probably don't really need to do that, we could probably do it immediately. I could push up a branch that does that if anyone would be willing to test?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/prometheus-community/windows_exporter/issues/551#issuecomment-783589673, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADUP2G4DHCQSV5ILTHGJDQ3TAKQ7BANCNFSM4OD2L4ZA .

-- Tim Biles Sys/Database Design/Admin 3 | ITSS | itss.d.umn.edu Storage Champion | z.umn.edu/scn University of Minnesota Duluth | www.d.umn.edu tbiles@d.umn.edu | 218-726-6959

majerus1223 commented 3 years ago

On 5 servers I tried the delayed start with very limited success. 3 - 2012 2 - 2016

dcepulis commented 3 years ago

@carlpett I do not mind testing it out. My team is already wee bit annoyed with me, so I just took over windows alerting oncall. For now I just set windows_exporter dependency to "Windows Management Instrumentation".

And indeed it is wee bit too hard to test unless you have Windows server VM template which was not updated.

@tbiles For us Automatic (Delayed Start) did not help at all. Servers running MSSQL, Passwordstate and VDI had same issue of windows exporter not starting. What is strange that AD, CA and DC so far running smoothly.

bpickhardt commented 3 years ago

It's a crapshoot when updates get installed as people have said on if it reboots without issue or not. Changing to delayed start seemed to reduce the number of alerts about the exporter not starting but it still doesn't start on every host after updates. I'd suggest snapshotting a host without updates and using that snapshot to test with. that way you can revert after the updates and test again.

carlpett commented 3 years ago

Ok, so anyone willing to give it a shot, here's a binary from CI. But I guess if you only have reboot windows around patch days, it'll be somewhat less useful...

dcepulis commented 3 years ago

@carlpett I will test binary with Windows Server 2016 template and in one of our prod servers when patch evening swings by.

majerus1223 commented 3 years ago

We have some test machines I can run it on, and restart them a few times.

dcepulis commented 3 years ago

@majerus1223 could you test out possible scenarios where you have IIS or MSSQL installed on one of those machines and either nothing or just simple AD? I am asking because those types of services tend to cause issues for us. Cheers!

tbiles commented 3 years ago

So would this be the install process? 1) stop windows_exporter service 2) backup 0.15.0 exe 3) rename windows_exporter-0.15.1-immediate-service-run.1+50-amd64.exe to windows_exporter.exe 4) copy to C:\Program Files\windows_exporter folder 5) start windows_exporter service

On Mon, Feb 22, 2021 at 1:01 PM Calle Pettersson notifications@github.com wrote:

Ok, so anyone willing to give it a shot, here's a binary from CI https://ci.appveyor.com/api/buildjobs/senfi8b95b4qok8p/artifacts/output%2Famd64%2Fwindows_exporter-0.15.1-immediate-service-run.1%2B50-amd64.exe. But I guess if you only have reboot windows around patch days, it'll be somewhat less useful...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/prometheus-community/windows_exporter/issues/551#issuecomment-783599594, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADUP2G5MUNNKHF3SJLFQBHDTAKSZPANCNFSM4OD2L4ZA .

-- Tim Biles Sys/Database Design/Admin 3 | ITSS | itss.d.umn.edu Storage Champion | z.umn.edu/scn University of Minnesota Duluth | www.d.umn.edu tbiles@d.umn.edu | 218-726-6959

majerus1223 commented 3 years ago

On 3 machines I tried to stop the service, and move the exe into place with the name windows_exporter.exe , each time it will not start. Additionally if its ran manually I see the error.

"failed to start service he service process could not connect to the service controller"

@dcepulis two of the boxes are running mssql, we do not run iis so not much I can do there.

JDA88 commented 3 years ago

... One possibility is that there is something in the connection to the Windows Event Log? ...

I never heard of any issue writing in the event logs from a service, it's one of the first thing with the registry that is availeable pretty much before eerything else. Maybe if you encounter an error in this phase the exe should not crash but do a bunch of retry of dump something in the %TEMP% folder for us to investigate. Crashing without any log is what make it difficult to pinpoint.

majerus1223 commented 3 years ago

Quick note, one of my co workers has 12 2019, and 2008 r2 boxes running MSSQL with the exporter on delay start without a problem for the last 6 months. Just a data point to throw out there.

carlpett commented 3 years ago

@tbiles Yes, that should work. @JDA88 We've tried logging to file, and it doesn't work either, sadly. From what I remember, we can't get any sign it even reaches main(). @majerus1223 Hm, okay, I might of course have screwed something up in that quick patch :confused: I'll have a look

tbiles commented 3 years ago

New file won't start for me either. It

A timeout was reached (30000 milliseconds) while waiting for the windows_exporter service to connect.

The windows_exporter service failed to start due to the following error: The service did not respond to the start or control request in a timely fashion.

On Mon, Feb 22, 2021 at 1:44 PM Calle Pettersson notifications@github.com wrote:

@tbiles https://github.com/tbiles Yes, that should work. @JDA88 https://github.com/JDA88 We've tried logging to file, and it doesn't work either, sadly. From what I remember, we can't get any sign it even reaches main(). @majerus1223 https://github.com/majerus1223 Hm, okay, I might of course have screwed something up in that quick patch 😕 I'll have a look

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/prometheus-community/windows_exporter/issues/551#issuecomment-783626507, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADUP2GZ22WYDAQQEXKWNE4DTAKX2LANCNFSM4OD2L4ZA .

-- Tim Biles Sys/Database Design/Admin 3 | ITSS | itss.d.umn.edu Storage Champion | z.umn.edu/scn University of Minnesota Duluth | www.d.umn.edu tbiles@d.umn.edu | 218-726-6959

carlpett commented 3 years ago

Yeah, I made a silly mistake. Pushing a new version.

carlpett commented 3 years ago

New binary

tbiles commented 3 years ago

That one seems to be starting up fine, thanks.

On Mon, Feb 22, 2021 at 2:01 PM Calle Pettersson notifications@github.com wrote:

New binary https://ci.appveyor.com/api/buildjobs/5ukospmcrjnfyf47/artifacts/output%2Famd64%2Fwindows_exporter-0.15.1-immediate-service-run.1%2B51-amd64.exe

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/prometheus-community/windows_exporter/issues/551#issuecomment-783636287, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADUP2G42SP6FOSDTPL7IKK3TAKZY3ANCNFSM4OD2L4ZA .

-- Tim Biles Sys/Database Design/Admin 3 | ITSS | itss.d.umn.edu Storage Champion | z.umn.edu/scn University of Minnesota Duluth | www.d.umn.edu tbiles@d.umn.edu | 218-726-6959

JDA88 commented 3 years ago

@JDA88 We've tried logging to file, and it doesn't work either, sadly. From what I remember, we can't get any sign it even reaches main().

If you dont reach main() it could be a DLL module dependency / loading issue that crash on initialisation... I have no experience in GO, how does the embeded modules loading sequence works? Can you do a kind of "late binding" where you control the loading sequence manualy instead of relying on the framework to do so? Or can you bind to "events" on the loading sequence?

Sorry if part of this doesn't make sense in GO, trying to use my experience of other frameworks.