sensu / sensu-go

Simple. Scalable. Multi-cloud monitoring.
https://sensu.io
MIT License
1.02k stars 176 forks source link

Random checks are not getting executed #4330

Closed raihanchdy closed 1 year ago

raihanchdy commented 3 years ago

I have more than 200 checks running on sensu 3 node cluster. Randomly some of the checks are not getting exeuted.

image

This check is scheduled to run every 5 minutes but the events showing that the check executed 2 hours ago which I ran executed manually from the UI.

Please help to fix the issue

mcatngena commented 3 years ago

How do you expect Sensu guys to help if you don't provide enough details? You even did not follow the template where you find guidance what details to fill...

Anyway, we experienced this several times as well (no issue submitted from our side yet).

calebhailey commented 3 years ago

@raihanchdy we will need more information to determine if this is a bug or a simple configuration issue. Can you first of all check to see if there are any agent entities configured with the roundrobin:worker subscription? The following command should help:

sensuctl entity list --field-selector='"roundrobin:worker" IN entity.subscriptions'

NOTE: in Sensu Go it is no longer necessary to prefix subscriptions with roundrobin:, so your roundrobin:worker subscription is not being parsed or having any other special handling applied to it; I'll assume this is a vestige remaining after migrating configs from Sensu Core.

raihanchdy commented 3 years ago

@calebhailey Please find the output of the command sensuctl entity list --field-selector='"roundrobin:worker" IN entity.subscriptions' image

All of a sudden most of the proxyclient checks stopped execution The check I shared was running but all of a sudden its not even executing image

image

raihanchdy commented 3 years ago

@calebhailey More Info: A check is scheduled to execute every 2 hours but all of a sudden the check stopped execution. No error logged when a check is not executing. Without error its very tough to know the why a check is not getting executed

Workaround: Changed the time interval to 2 minutes and recreated the check and then check started execution.

Recreated the check with sensuctl command and the check started execution. My concern is as long as sensu cluster is running the check should be running

No idea, how this can be possible

raihanchdy commented 3 years ago

@calebhailey I installed sensu 6.4.0 and the issue still persists where checks are not executing at their specified interval. Sensu is mostly used for executing checks and if checks itself aren't executing then its a huge issue. My kind request is can you please look into it and let me know if you need more details. image The above check is configured to execute every 15 minutes but its already 32 minutes where the check has not executed. image

Please let me know for additional details.

raihanchdy commented 3 years ago

@calebhailey The issue persists even in the latest release 6.4.2. Could you please let me know if this a bug and when the fix will come. With this issue, sensu has become unusable.

The issue is with roundrobin checks with proxyclient attribute.

mcatngena commented 3 years ago

@raihanchdy I can't see definition entity attributes for proxy check. You are using proxy client, thus you have entity attributes defined in the check I assume. Can you share that?

raihanchdy commented 3 years ago

@mcbsd please find the check definition { "api_version": "core/v2", "type": "Check", "metadata": { "namespace": "default", "name": "vm-diskusage-check-mnt-eslog-critical", "labels": {}, "annotations": { "sensu.io.json_attributes": "{\"type\":\"standard\",\"refresh\":7200}", "fatigue_check/interval": "7200" } }, "spec": { "command": "python3.6 /etc/sensu/plugins/vm-alerts.py disk disk_data_elk --critical 85 --check vm-diskusage-check-mnt-eslog-critical", "subscriptions": [ "worker" ], "round_robin": true, "publish": true, "cron": "/15 *", "handlers": [ "alert_handler_no_host", "resolve_handler_no_host", "ops_alert_handler_no_host", "tester_handler" ], "proxy_entity_name": "proxyclient", "timeout": 890 } }

raihanchdy commented 3 years ago

In sensu server log there is an entry with sending check request but the check is not executing on the agent

mcatngena commented 2 years ago

@raihanchdy I don't think this is Sensu issue. Seems like your check definition is incomplete. Proxy check needs to have entity attributes, for instane (including splay, optional):

... "entity_attributes": [ "entity.entity_class == 'proxy'", "entity.name.indexOf('firewall') >= 0" ], "splay": true, "splay_coverage": 90`` ...

I suggest to read this https://docs.sensu.io/sensu-go/latest/observability-pipeline/observe-schedule/checks/#use-a-proxy-check-to-monitor-multiple-proxy-entities , and then think if what you have fits to the concept.

What is actually Sensu issue, that you can execute proxy checks on demand, that's real fail here, issue is opened for it, hopefully one day...

asachs01 commented 1 year ago

@raihanchdy following up here since @mcbsd replied on the issue. If this is still happening on a recent Sensu release, let's move the conversation over to Sensu Community Slack or to the Sensu Discourse Forums