newrelic / nri-winservices

Windows services Integration for New Relic Infrastructure
Apache License 2.0
8 stars 8 forks source link

NRQL Alert incidents do not close after service is back up #106

Closed kimnjeru closed 2 years ago

kimnjeru commented 2 years ago

Alert policy fires as expected when service is down and opens an incident but the incident never closes when service is back up as reported on this internal link

Description

The NRQL Alert policy below is used to check when a specific service is down. In this case it is the 'Windows Time' service 'w32time' but this applies to any service.

FROM Metric SELECT count(*) WHERE start_mode='auto' AND state!='running' where service_name='w32time' FACET service_name

Expected Behavior

When service is started after being down, the incident does not close. Expectation is that the incident closes shortly after the service is back up.

Steps to Reproduce

  1. Deploy New Relic Infrastructure agent 1.20.2 and above and enable this integration
  2. Create NRQL alert policy on New Relic with NRQL shown above
  3. Stop a service, wait a few minutes and start the service.
carlossscastro commented 2 years ago

@kimnjeru

This is related to how alerts evaluate COUNT when 0 values are returned. COUNT never returns 0. It returns null instead. For that reason you will need to edit your alert policy and add a signal loss threshold to close the violations after X amount of minutes.

With signal loss the alert will close after the service is back up and running and when the count returns to 0 (or null)

I'll go ahead and close this issue but feel free to reopen it if it still doesn't work for you or if you need any other help.

Rg,