sensu / sensu-go

Simple. Scalable. Multi-cloud monitoring.
https://sensu.io
MIT License
1.03k stars 174 forks source link

Single instances of wrong proxy check running against entity #3150

Open adammcdonagh opened 5 years ago

adammcdonagh commented 5 years ago

As reported in Slack: I’ve been getting very strange issues with proxy entities since upgrading to 5.11.

Some entities get a single run of a check against them when none of the rules match at all.

I cannot reproduce it on demand, but it has happened twice this week on 2 completely different agents.

Expected Behavior

Checks should only run against nodes that match the proxy entity attributes exactly

Current Behavior

Random occurances of a check being executed where at least 1 of the entity attributes do not match. In my case the "proxy_via" label.

Event:

"check": {
      "command": "/opt/sensu/plugins/ssh-wrapper.sh bXXXX109 \"~/plugins/metrics-db2.pl --metricType dbstats --clientConfigFile ~/conf.d/db2-instances.json --statsConfigFile ~/conf.d/db2-stats.json\"",
      "handlers": [],
      "high_flap_threshold": 0,
      "interval": 60,
      "low_flap_threshold": 0,
      "publish": true,
      "runtime_assets": null,
      "subscriptions": [
        "sensu-proxy-server-db2-vXXXX003"
      ],
      "proxy_entity_name": "bXXXX109",
      "check_hooks": null,
      "stdin": false,
      "subdue": null,
      "ttl": 0,
      "timeout": 30,
      "proxy_requests": {
        "entity_attributes": [
          "entity.entity_class == 'proxy'",
          "entity.labels.proxy_type.indexOf('db2-server') >= 0",
          "!('excludes' in entity.labels) || (('excludes' in entity.labels) && entity.labels.excludes.indexOf('db2-metrics-database') < 0)",
          "entity.labels.proxy_via == 'vXXXX003'"
        ],
        "splay": false,
        "splay_coverage": 0
      },
      "round_robin": false,
      "duration": 0.580670429,
      "executed": 1563343412,
      "history": null,
      "issued": 1563343412,
      "output": "",
      "status": 0,
      "total_state_change": 0,
      "last_ok": 0,
      "occurrences": 1,
      "occurrences_watermark": 1,
      "output_metric_format": "graphite_plaintext",
      "output_metric_handlers": [
        "graphite_tcp"
      ],
      "env_vars": null,
      "metadata": {
        "name": "metrics-db2-database-proxy-vXXXX003",
        "namespace": "default"
      }
    },
    "metadata": {
      "namespace": "default"
    }
  }

As you can see, there's no history.. It ran once, and hasn't run again since.

Actual entity definition, showing the proxy_via flag is not vXXXX003:

{
    "entity_class": "proxy",
    "system": {
      "network": {
        "interfaces": null
      }
    },
    "subscriptions": null,
    "last_seen": 0,
    "deregister": false,
    "deregistration": {},
    "metadata": {
      "name": "bXXXX109",
      "namespace": "default",
      "labels": {
        "db2_type": "",
        "proxy_type": "aix-server,hacmp-server,db2-server",
        "proxy_via": "vXXXX004"
      }
    }
  }

Steps to Reproduce (for bugs)

I don't have a way to reproduce it. It seems to be pretty random. Only happened twice in the past 3 days.

Context

Causing false alerts... We have checks that tell us if a check has not been executed recently (based on the interval that'd defined). Since we only get 1 trigger of the check, we then get alerts telling us that this anomalous check has not been executed recently.

Your Environment

RHEL 7, running latest v5.11 server and v5.10 agent

palourde commented 5 years ago

Hi @adammcdonagh,

I simply wanted to check in with you and determine if you are still facing this issue. If so, could you try to grab some of the logs around the time the check got scheduled for the wrong entity?

Thanks!

adammcdonagh commented 5 years ago

Yep, it has still been happening. Seems to be happening a few times a week.. Next time I see it I'll get some logs (or hopefully there's still a log on the server that hasn't been housekept yet)

adammcdonagh commented 5 years ago

@palourde Just had a look. The last time it happened was Friday, but there's nothing logged around that time 😞

palourde commented 5 years ago

The fact that it's intermittent makes me believe there's possibly something wrong going on with the entity itself. We actually fixed a relatively similar bug in 5.10.1 (see https://github.com/sensu/sensu-go/issues/3086) but I don't think it's related, however I don't see how the patch could lead to the behavior you saw; it would always schedule the check, not just from time to time.

Do you have the full raw event for the entity that's not supposed to run?

Also, can you confirm me that you are running a single sensu-backend instance, on version 5.11.0? Also, just to make sure, did you restarted your backend since you upgraded?

Thanks

adammcdonagh commented 5 years ago

I thought what I put in above was the full event, I think I'm just missing the surrounding brackets. Don't have any examples at the moment unfortunately.

It's not just 1 entity that it happens to... Mainly it does seem to be proxy type entities where this happens... However we have also seen this happen on entities that are running the actual agent itself.

We're running a multi-node backend with an external etcd cluster, so it's not simple unfortunately. I was looking at the logs across all hosts though.

# rpm -qa | grep sensu
sensu-go-agent-5.10.2-4539.x86_64
sensu-go-backend-5.11.0-4888.x86_64
sensu-go-cli-5.10.2-4539.x86_64

All backends are at the same version.

It is v5.11 and it has been restarted multiple times since it was upgraded (mainly due to the other issues I have raised (checks not getting scheduled), which need restarts to fix).

I initially haven't upgraded the agents to 5.11, as I didn't see anything in 5.11 that would resolve the issues we were having (it was the high CPU after loss of communication that was the main reason we upgraded the backend). Do you think it's worth upgrading all the agents to v5.11 too?

palourde commented 5 years ago

All the proxy check logic lives within sensu-backend so upgrading sensu-agent to 5.11.0, while recommended, should not change the behavior you saw.

I believe the full event might possibly provide us more information, especially the entity object.

Also, https://github.com/sensu/sensu-go/issues/3107 & https://github.com/sensu/sensu-go/issues/3069 could possibly help us debug this issue once implemented.

adammcdonagh commented 5 years ago

Hi, It has just occurred again.. I have got this from the events API and found to actual event itself:

{
    "timestamp": 1565781653,
    "entity": {
      "entity_class": "proxy",
      "system": {
        "network": {
          "interfaces": null
        }
      },
      "subscriptions": [
        "entity:db-server-1"
      ],
      "last_seen": 0,
      "deregister": false,
      "deregistration": {},
      "metadata": {
        "name": "db-server-1",
        "namespace": "default",
        "labels": {
          "db2_type": "hadr",
          "proxy_type": "aix-server,hacmp-server,db2-server",
          "proxy_via": "proxy-server-1",
          "ssh_hostname": "db-server-1-mgmt"
        }
      }
    },
    "check": {
      "command": "/opt/sensu/plugins/ssh-wrapper.sh db-server-1-mgmt \"~/plugins/metrics-wrapper.pl --frequency 30s\"",
      "handlers": [],
      "high_flap_threshold": 0,
      "interval": 30,
      "low_flap_threshold": 0,
      "publish": true,
      "runtime_assets": null,
      "subscriptions": [
        "sensu-proxy-server-aix-proxy-server-2"
      ],
      "proxy_entity_name": "db-server-1",
      "check_hooks": null,
      "stdin": false,
      "subdue": null,
      "ttl": 0,
      "timeout": 25,
      "proxy_requests": {
        "entity_attributes": [
          "entity.entity_class == 'proxy'",
          "entity.labels.proxy_type.indexOf('aix-server') >= 0",
          "entity.labels.proxy_via == 'proxy-server-2'"
        ],
        "splay": false,
        "splay_coverage": 0
      },
      "round_robin": false,
      "duration": 2.80868138,
      "executed": 1565781650,
      "history": null,
      "issued": 1565781650,
      "output": "REMOVED",
      "status": 0,
      "total_state_change": 0,
      "last_ok": 0,
      "occurrences": 1,
      "occurrences_watermark": 1,
      "output_metric_format": "graphite_plaintext",
      "output_metric_handlers": [
        "graphite_tcp"
      ],
      "env_vars": null,
      "metadata": {
        "name": "metrics-wrapper-os-only-30s-unix-proxy-proxy-server-2",
        "namespace": "default",
        "annotations": {
          "source_role": "sensu-on-prem-proxy"
        }
      }
    },
    "metadata": {
      "namespace": "default"
    }
  }

I've switched out the server names and the output, the rest is unchanged. But you can see the entity definition has "proxy-via" as proxy-server-1, however proxy-server-2 appears to have executed the check... But only once (the interval is 30 seconds, so if it was correct, there'd be some history, rather than just 1 instance)

Here's the full entity definition of the db-server-1 and proxy-server-1 and proxy-server-2

DB server:

{
    "entity_class": "proxy",
    "system": {
      "network": {
        "interfaces": null
      }
    },
    "subscriptions": null,
    "last_seen": 0,
    "deregister": false,
    "deregistration": {},
    "metadata": {
      "name": "db-server-1",
      "namespace": "default",
      "labels": {
        "db2_type": "hadr",
        "proxy_type": "aix-server,hacmp-server,db2-server",
        "proxy_via": "proxy-server-1",
        "ssh_hostname": "db-server-1-mgmt"
      }
    }
  }

Proxy Server 1:

{
    "entity_class": "agent",
    "system": {
      "hostname": "proxy-server-1",
      "os": "linux",
      "platform": "redhat",
      "platform_family": "rhel",
      "platform_version": "7.6",
      "network": {
        "interfaces": [
          {
            "name": "lo",
            "addresses": [
              "127.0.0.1/8"
            ]
          },
          {
            "name": "eth0",
            "mac": "XXXXXXXX",
            "addresses": [
              "10.XXX.0.XXX/24"
            ]
          }
        ]
      },
      "arch": "amd64"
    },
    "subscriptions": [
      "unix-hosts",
      "rhel-hosts",
      "sensu-proxy-server-db2-proxy-server-1",
      "sensu-proxy",
      "sensu-proxy-server-aix-proxy-server-1",
      "entity:proxy-server-1"
    ],
    "last_seen": 1565796089,
    "deregister": false,
    "deregistration": {},
    "user": "agent",
    "redact": [
      "password",
      "passwd",
      "pass",
      "api_key",
      "api_token",
      "access_key",
      "secret_key",
      "private_key",
      "secret"
    ],
    "metadata": {
      "name": "proxy-server-1",
      "namespace": "default"
    }
  }

Proxy Server 2:

{
    "entity_class": "agent",
    "system": {
      "hostname": "proxy-server-2",
      "os": "linux",
      "platform": "redhat",
      "platform_family": "rhel",
      "platform_version": "7.6",
      "network": {
        "interfaces": [
          {
            "name": "lo",
            "addresses": [
              "127.0.0.1/8"
            ]
          },
          {
            "name": "eth0",
            "mac": "XXXXXXXXX",
            "addresses": [
              "10.XXX.0.XXX/24"
            ]
          }
        ]
      },
      "arch": "amd64"
    },
    "subscriptions": [
      "unix-hosts",
      "rhel-hosts",
      "sensu-proxy-server-db2-proxy-server-2",
      "sensu-proxy",
      "sensu-proxy-server-aix-proxy-server-2",
      "entity:proxy-server-2"
    ],
    "last_seen": 1565796083,
    "deregister": false,
    "deregistration": {},
    "user": "agent",
    "redact": [
      "password",
      "passwd",
      "pass",
      "api_key",
      "api_token",
      "access_key",
      "secret_key",
      "private_key",
      "secret"
    ],
    "metadata": {
      "name": "proxy-server-2",
      "namespace": "default"
    }
  }

Hope this helps

adammcdonagh commented 5 years ago

Was hoping this would have been fixed by the latest release, however it's still happening.. It's happening quite frequently too.

We've got 70 proxy entities, each with about 20 checks each

jkoelndorfer commented 5 years ago

I am also seeing this issue on Sensu 5.14.1. I am happy to help provide any additional information to help resolve, but as in @adammcdonagh's I cannot reproduce this on demand.