sensu / sensu-go

Simple. Scalable. Multi-cloud monitoring.
https://sensu.io
MIT License
1.03k stars 175 forks source link

Proxy checks not scheduled when executed manually from check level (not device level) #3905

Open mcatngena opened 4 years ago

mcatngena commented 4 years ago

Hi,

we are experiencing problem with proxy checks which as it is not possible to ad-hoc schedule them. Disclaimer: We know it is not possible to schedule them from device layer - from device home page - to schedule specifically on that device. But as with old Sensu Core, scheduling from check layer (from check home page) to schedule across all matching entities, should work.

There's feeling it might have same root cause as https://github.com/sensu/sensu-go/issues/3857

Expected Behavior

When check is executed from check home page, check should be executed across all matching entities, in same way when it's time to auto schedule based on time interval.

Current Behavior

When check is executed from check home page, nothing happens, check result for matching device is not updated. It is updated only during auto schedule based on time interval. At the time of check execution form the portal (or sensuctl), this log appears: {"check":"cisco_ipsla","component":"schedulerd","level":"warning","msg":"no matching entities, check will not be published","namespace":"default","time":"2020-07-15T10:45:46+02:00"} Despite the fact that there are devices which match entity_attributes in proxy check definition - and these match happens OK when check is scheduled automatically by defined interval.

Possible Solution

N/A Very dirty workaround is to redefine scheduling interval for a moment, to shorter period and then change it back. But we have to delete and recreate the check in order to do that (due to https://github.com/sensu/sensu-go/issues/3857)

Steps to Reproduce (for bugs)

1.best would be to simulate that on the cluster setup 2.create proxy check which matches at least one device 3.observe auto scheduling 4.execute the check from the check page and observe if result for respective device was updated

Check definition:

type: CheckConfig
api_version: core/v2
metadata:
  name: cisco_ipsla
  namespace: default
spec:
  check_hooks: null
  command: <command>
  env_vars: null
  handlers:
    - incident
    - logger
  high_flap_threshold: 0
  interval: 10800
  low_flap_threshold: 0
  output_metric_format: ""
  output_metric_handlers: null
  proxy_entity_name: ""
  proxy_requests:
    entity_attributes:
    - entity.entity_class == 'proxy'
    - entity.subscriptions.indexOf('cisco_ipsla') >= 0
    splay: true
    splay_coverage: 90
  publish: true
  round_robin: true
  runtime_assets: null
  stdin: false
  subdue: null
  subscriptions:
  - proxy_agent
  timeout: 0
  ttl: 0

Proxy entity definition:

type: Entity
api_version: core/v2
metadata:
  #annotations:
  labels:
    model: asr9k
  name: router01
  namespace: default
spec:
  deregister: false
  deregistration: {}
  entity_class: proxy
  last_seen: 0
  sensu_agent_version: ""
  subscriptions:
  - cisco_ipsla

Context

It breaks the concept to be able to ad-hoc schedule the check. We know it is not possible per device basis, but it should be possible from global check perspective

Your Environment

5.21.0 compiled from source; cluster with 3 nodes (for sensuctl, agents and backend) embedded etcd RedHat 7.8 Virtual Machine on RedHat Virtualization cluster Proxy checks are configured to be scheduled round robin with splay enabled

mcatngena commented 4 years ago

Description has been updated with object definitions

mcatngena commented 4 years ago

Hi, important note here: We have tested the same on "one node" Sensu Go deployment with 5.21.0 enterprise version and we see the same unexpected behavior.

mcatngena commented 4 years ago

We also did test for https://github.com/sensu/sensu-go/issues/3857 and result was the same. We think these issues are related.

mcatngena commented 4 years ago

Hi,

was someone able to encounter/simulate the same?

calebhailey commented 4 years ago

@mcbsd thanks for the detailed issue!

Are you making the ad hoc request from the dashboard or the /checks/:check/execute API?

mcatngena commented 4 years ago

Hi @calebhailey we were executing ad-hoc request via dashboard and I just tested the same via API directly and we get the same unexpected behavior. Also the same when executed via sensuctl. I was expecting same result as I believe it will execute same internal mechanism at the end. Here are details from API request:

curl -k -v -X POST -H "Authorization: Key $SENSU_API_KEY" -H 'Content-Type: application/json' -d '{
  "check": "cisco_ipsla"
}' https://127.0.0.1:8080/api/core/v2/namespaces/default/checks/cisco_ipsla/execute
* About to connect() to 127.0.0.1 port 8080 (#0)
*   Trying 127.0.0.1...
* Connected to 127.0.0.1 (127.0.0.1) port 8080 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
* skipping SSL peer certificate verification
* SSL connection using TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
* Server certificate:
...
> POST /api/core/v2/namespaces/default/checks/cisco_ipsla/execute HTTP/1.1
> User-Agent: curl/7.29.0
> Host: 127.0.0.1:8080
> Accept: */*
> Authorization: Key ***
> Content-Type: application/json
> Content-Length: 50
> 
* upload completely sent off: 50 out of 50 bytes
< HTTP/1.1 202 Accepted
< Date: Tue, 04 Aug 2020 13:16:58 GMT
< Content-Length: 21
< Content-Type: text/plain; charset=utf-8
< 
* Connection #0 to host 127.0.0.1 left intact
{"issued":1596547018}

And message from the sensu log:

Aug 04 15:16:58 server01 sensu-backend[105083]: {"check":"cisco_ipsla","component":"schedulerd","level":"warning","msg":"no matching entities, check will not be published","namespace":"default","time":"2020-08-04T15:16:58+02:00"}
mcatngena commented 4 years ago

Hi @calebhailey , it there something more we can collect? Do you already know when this could be fixed? Thank You.

calebhailey commented 4 years ago

@mcbsd thanks for the confirmation. I don't believe we need any additional information at this time. We'll update this card again when we have an ETA for a fix.

Cheers

portertech commented 4 years ago

We would need to implement whole new functionality to enable this. In Sensu classic, this was accomplished by implementing a whole separate proxy check request publisher within the API itself. I removed the bug label and added feature to indicate a significant amount of work necessary.

mcatngena commented 4 years ago

Thank you for validation @portertech Really no bad blood here, but wondering how this could slip because I believe that Sensu is used in environments where you have at least some portion of devices as proxy entities. Manual scheduling from check level is key activity when working with checks which are scheduled over greater time frame and you want to re-trigger due to bugfixes in check plugins and so on. So, in short, I see this as very essential feature. I am not glad to see that is requires significant amount of work:/ but whenever you can bring this up, we will be very thankful. Thanks for great a product anyway:)

Looking forward for progress on this.

mcatngena commented 3 years ago

Hi, any plan with this? 😇

mcatngena commented 3 years ago

To little top up this one:) It would be really appreciated if you could starting looking into this. Proxy entities should have as much level of the function parity compared to standard agent entities as possible. This one is definitely one of them. All the environments I've worked with had noticeable amount of network devices/appliances. Thank You

mcatngena commented 3 years ago

@calebhailey @portertech Could you please share some news regarding this issue/feature? How other customers use proxy devices if subjected functionality is missing? Any estimation when this could be included into your roadmap?

Thank You

calebhailey commented 3 years ago

@mcbsd we do not have an estimate delivery date for this feature at this time. Proxy requests are a very popular feature for many of our paying customers, but so far not many (or any?) are reporting issues with ad hoc executions.

We will leave this issue open as I think it makes sense for us to improve the scheduler to support ad hoc requests.

acrawly commented 3 years ago

We would need to implement whole new functionality to enable this. In Sensu classic, this was accomplished by implementing a whole separate proxy check request publisher within the API itself. I removed the bug label and added feature to indicate a significant amount of work necessary.

I don't mean to sound blunt, but can you elaborate on why/how this is not a bug vs. a feature request?

mcatngena commented 3 years ago

@calebhailey I understand....even i can't believe customers don't use ad-hoc executions for proxy checks, same as they definitely do for standard agent entities. If you have checks scheduled every hour, two...and you need to "reload" it even to see actual values or after fixing some underlying (e.g. script in the command) issue, there is simply no way to do it, which I think should be the key functionality of monitoring system. At least if that could be possible somehow via API or CLI, it would make life easier.

mcatngena commented 3 years ago

Any rough update where this could be seen in the roadmap? Or at least if anyone has a hint for a workaround...?