IlliciteS commented 1 year ago

Hello,

We ran into something we cannot understand. All our Praeco alarms are set up through Praeco, manually (not by yaml, so). Some alarms run pretty well while others... Just stop working all of sudden.

An example :

AlarmMainView

The data in the main view are fine.

QueryLog

Here's the Query Log view. You can see it stopped working the 4/6/2023 9:00:00 AM. I don't know why. And I could not make it work again, so I had to duplicate the alarm, edit it, change the limit execution by a small number (like 4 min instead of 5), save it, delete the old broken alarm, go back to the duplicated alarm and change its name back to the original one.

Query log worked... And then, since 6/21/2023 2:34:52 PM, it does not run anymore.

Praeco Yaml : `praeco_full_path: "FOTT/Services/High number error access Service TV HISENSE" praeco_query_builder: "{\"query\":{\"logicalOperator\":\"all\",\"children\":[{\"type\":\"query-builder-rule\",\"query\":{\"rule\":\"actKey\",\"selectedOperator\":\"contains\",\"selectedOperand\":\"actKey\",\"value\":\"accessServiceError\"}},{\"type\":\"query-builder-rule\",\"query\":{\"rule\":\"media\",\"selectedOperator\":\"contains\",\"selectedOperand\":\"media\",\"value\":\"tvhisense\"}}]}}" alert:

"ms_teams"
"pagerduty" alert_subject: "High number error access Service TV HISENSE" alert_text: "" doc_type: "dynamic_templates" filter:
query: query_string: query: "actKey:accessServiceError AND media:tvhisense" generate_kibana_discover_url: true import: "../../BaseRule.config" index: "switchplus-ott-prod" is_enabled: true kibana_discover_app_url: "" kibana_discover_columns:
"actkey.keyword" kibana_discover_from_timedelta: minutes: 10 kibana_discover_index_pattern_id: "fb57fb40-67ed-11ec-9329-ff0008e8c62a" kibana_discover_to_timedelta: minutes: 10 kibana_discover_version: "7.15" limit_execution: "0 /1 " match_enhancements: [] ms_teams_alert_summary: "ElastAlert Message" ms_teams_attach_kibana_discover_url: false ms_teams_kibana_discover_title: "Discover in Kibana" ms_teams_proxy: "" ms_teams_theme_color: "#F912DE" ms_teams_webhook_url: "" name: "High number error access Service TV HISENSE" num_events: 2500 pagerduty_api_version: "v2" pagerduty_client_name: "elastalert" pagerduty_event_type: "trigger" pagerduty_incident_key: "nbErrorAccessTVHISENSE" pagerduty_service_key: "R03DL1FOO4FU3BPVDWUPARRTGWOCH29Z" pagerduty_v2_payload_component: "threshold" pagerduty_v2_payload_group: "FOTT" pagerduty_v2_payload_severity: "critical" pagerduty_v2_payload_source: "ElastAlert" priority: 2 realert: hours: 1 terms_size: 50 timeframe: hours: 1 timestamp_field: "dt" timestamp_type: "iso" type: "frequency" use_count_query: true use_strftime_index: false`

Some alarms, when doing that duplicate workaround, just don't update at all and get this Query Log tab (while the graph in the overwiew is perfectly working) : QueryLogNoData

Its overview:

Overview

Praeco Yml for that one: `praeco_full_path: "FOTT/Usage/Nb de lancement de player bas PLAYSTATION CPFRA" praeco_query_builder: "{\"query\":{\"logicalOperator\":\"all\",\"children\":[{\"type\":\"query-builder-rule\",\"query\":{\"rule\":\"actKey\",\"selectedOperator\":\"contains\",\"selectedOperand\":\"actKey\",\"value\":\"launchOnePlayer\"}},{\"type\":\"query-builder-rule\",\"query\":{\"rule\":\"media\",\"selectedOperator\":\"contains\",\"selectedOperand\":\"media\",\"value\":\"playstation\"}},{\"type\":\"query-builder-rule\",\"query\":{\"rule\":\"zone\",\"selectedOperator\":\"contains\",\"selectedOperand\":\"zone\",\"value\":\"cpfra\"}}]}}" alert:

"pagerduty" alert_subject: "Very low launched player on PLAYSTATION on CPFRA" alert_text: "Very low launched player on PLAYSTATION on CPFRA" doc_type: "dynamic_templates" filter:
query: query_string: query: "actKey:launchOnePlayer AND media:playstation AND zone:cpfra" generate_kibana_discover_url: true import: "../../BaseRule.config" index: "switchplus-ott-prod" is_enabled: true kibana_discover_app_url: "" kibana_discover_from_timedelta: minutes: 10 kibana_discover_index_pattern_id: "fb57fb40-67ed-11ec-9329-ff0008e8c62a" kibana_discover_to_timedelta: minutes: 10 kibana_discover_version: "7.15" limit_execution: "0 /1 " match_enhancements: [] name: "Nb de lancement de player bas PLAYSTATION CPFRA" pagerduty_api_version: "v2" pagerduty_client_name: "elastalert" pagerduty_event_type: "trigger" pagerduty_incident_key: "nbLaunchPlayerPlaystationCPFRA" pagerduty_service_key: "R03DL1FOO4FU3BPVDWUPARRTGWOCH29Z" pagerduty_v2_payload_component: "threshold" pagerduty_v2_payload_group: "FOTT" pagerduty_v2_payload_severity: "critical" pagerduty_v2_payload_source: "ElastAlert" priority: 2 realert: minutes: 10 terms_size: 50 threshold: 150 timeframe: hours: 1 timestamp_field: "dt" timestamp_type: "iso" type: "flatline" use_count_query: true use_strftime_index: false`

And some other alarms, after being duplicated and the original one removed, have their Query Log tab that goes back to the original's Query Log tab, like they were not deleted at all, keeping the old frozen historic. I am wondering if the old / first alarm has really been deleted (and if not, why it does not appear in Praeco).

And that's why I have 3 questions:

1 - Any idea why this happens (except after a docker being destroyed / rebuilt, I noticed that.) 2 - Is there a way to "reconnect" all the alarms after such an incident without duplicating them (when it works)? I try enable /disable an alarm, does not work. 3 - Are there any logs about a specific alarm in the Praeco docker and / or in the Elastalert docker? If so, where?

👀 Operating environment

your praeco / elastalect docker files
elasticsearch version : 7.15.2
version of praeco : praeco 1.8.13
praecoapp/elastalert-server:20230402

nsano-rururu commented 1 year ago

It may be related to the following movements. In the next version, I would like to change to add settings from the screen with Praeco. https://elastalert2.readthedocs.io/en/latest/elastalert.html

nsano-rururu commented 1 year ago

Try to solve by adding to BaseRule.config

disable_rules_on_error: false

IlliciteS commented 1 year ago

I added that into the BaseRule.config (and nothing else, just add it) but it does not work. That being said, we found a curious workaround:

If we go to Praeco and, for a frozen alarm, Edit -> Disable the "Limit Excecution", wait a bit, then the alarm will be "re enable". Then we can re-enable the Limit Execution and the alarm is still fine.

It also works through the yaml. So we plan to make a script which will add a # to comment the limite execution into the yaml for all the alarms, and then, like 2 min later, will delete that # to uncomment the limite execution.

Note: we update to your latest version of Praeco and Elastalert, and that workaround still works.

nsano-rururu commented 1 year ago

There is a comment that implements a function that only limits the execution of rules to a specific time of the day, rather than disabling alerts.

https://github.com/Yelp/elastalert/issues/492#issuecomment-438024625

I've merged this feature into a new branch, beta, and released it as a new package version 0.2.0b1 available on pypi.

This includes a couple other changes as well, like threading support, but you can now limit rule execution to certain times of the day using limit_execution using cron syntax. For example

limit_execution: " 7-22 " Would mean to only run the rule between 7 am and 10 pm every day.

This feature is still in beta, of course, but you're welcome to try.

IlliciteS commented 1 year ago

Yes, we are already using the limit_execution. To be clear, this:

Which equals, in yaml, to:

If I am not mistaken, right?

And so it's this feature that creates an issue (for us, at least). And we do not disable the alarm, but this feature, to make the alarms work again.

So the script we will implement will do that: First, it will comment the limit_execution in the yaml:

limit_execution: "0 /1 "

And the it will delete this #, ie, it will uncomment that feature so this feature works again: limit_execution: "0 /1 " The alarms always stay enable.

Before, the alarms were "frozen". The Query Log showed either "no data" or an old date. After this work around, the alarms are not frozen anymore, and work -> The query log tab display all the queries made by the alarms.

By the way, maybe we do not use the Limit Execution the proper way. We use it to run a query evey 5 min, or every one hour, for instance. If we want to run an alarm between 10 am and 11 pm, we use the "Use Time Window" feature, like shown in the first screen.

nsano-rururu commented 1 year ago

https://elastalert2.readthedocs.io/en/latest/ruletypes.html#limit-execution

nsano-rururu commented 1 year ago

https://crontab.cronhub.io/

nsano-rururu commented 1 year ago

this

nsano-rururu commented 1 year ago

https://github.com/Yelp/elastalert/issues/2119

IlliciteS commented 1 year ago

Thanks for your answer. That's interresting; this case is known since 2019. So perhaps we should disable the limit_execution for now and let the main cron runs every alarms.

nsano-rururu commented 1 year ago

Regarding limit_execution, I think there is a bug because I feel that there was some inquiry in the discussion of elastalert2. https://github.com/jertel/elastalert2/discussions

johnsusek / praeco

Alarms suddently stop alerting (query log without data or not updating / being frozen) #540

👀 Operating environment

limit_execution: "0 /1 "