Open jammiemil opened 2 years ago
Some supporting traces that show the startup time of the binary, for context Service initiation is done in main() so this MUST be reached within 30s to allow for a successfull startup:
VM with minimal load as well as minimal CPU ~15s
Same VM With high CPU Load at startup ~59s (This would trigger a service failure):
Thanks for the PR! Will take a look at this today.
This issue has been automatically marked as stale because it has not had any activity in the past 30 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed in 7 days if there is no new activity. Thank you for your contributions!
Reopening since there's a PR to fix this.
This is really annoying. On small windows vms the failure is pretty consistent. Anyway, as the fix is still not released, one can set the service to restart itself after failure, and delay that restart by arbitrary amounts of time, and that will eventually work.
I'm also affected by this, the service takes long to start while processing the WAL journal and it eventually starts after a few attempts. Unfortunately, even with automatic restart with unlimited retries - I still find the service stopped sometimes. Additionally, at random, I will sometimes be unable to control the service any more and have the error below, even tough the agent is running:
Hi there :wave:
On April 9, 2024, Grafana Labs announced Grafana Alloy, the spirital successor to Grafana Agent and the final form of Grafana Agent flow mode. As a result, Grafana Agent has been deprecated and will only be receiving bug and security fixes until its end-of-life around November 1, 2025.
To make things easier for maintainers, we're in the process of migrating all issues tagged variant/flow to the Grafana Alloy repository to have a single home for tracking issues. This issue is likely something we'll want to address in both Grafana Alloy and Grafana Agent, so just because it's being moved doesn't mean we won't address the issue in Grafana Agent :)
Similar to windows_exporter (And a bunch of other Go applications on windows) Grafana Agent can fail to start as a service on windows following a windows update or other high CPU Event during Service startup.
I wrote a fairly detailed analysis on the issue here that explains the cause but in an attempt to be brief this comes down to the way go initialises packages vs the time in which windows expects a response from an application starting as a service within 30s or 60s depending on your version of windows as per this diagram:
A way to work around this can be to delay the start of the service (Default 60s or 120s depending on your version of windows) but there are still some situations where the resource contention does not clear up in time and the service fails to start with the following recorded in the event log:
Ive submitted a PR against windows_exporter to move the Service initiation code out of main and into its own package so that the window service can be started as early in the startup as possible rather than in main() and i intend to do the same against grafana agent.